9. Affiliation processing

9.1. Affiliation Name

The terms organisation and affiliation are used throughout the TIM documentation interchangeably. The original affiliation fields included in each document are two: emm_affiliation__name and emm_affiliation__nameVariant. These are dependent on the source providing them, but special care has been taken to make them as consistent as possible.

  • Scopus
    • emm_affiliation__nameVariant is the full affiliation name line

    • emm_affiliation__name is the main subfield of the emm_affiliation__nameVariant (department name etc are disregarded).

  • Other datasources
    • emm_affiliation__nameVariant is one or more name variants of the organisation.

    • emm_affiliation__name is the main affiliation field provided.

Example

Source

Field

Value

Scopus

emm_affiliation__nameVariant

Dept. of Small Anim. Clin. Sciences,College of Veterinary Medicine,University of Florida

emm_affiliation__name

University of Florida

Patstat

emm_affiliation__nameVariant

ITM POWER RESEARCH LTD;ITM POWER (RESEARCH);ITM POWER (RESEARCH) LIMITED

emm_affiliation__name

ITM POWER RESEARCH LTD

Organisations names are fields that need to be processed further in TIM. The main reason for this is that they need to be harmonized across several databases: Scopus, Patstat, Cordis and all the other databases inserted into TIM all have affiliation records in their own format and style. Another important reason is the fact that, even in the original data, there are often duplicates and mistakes. Also, a common name needs to be decided sometimes, irrespective of the locality of the affiliation, or its place in a group hierarchy (is it a daughter company? is it bought by another? is it an umbrella organisation?).

Therefore, disambiguation algorithms are applied to the data in order to achieve a consistent denomination of the affiliations. The TIM module that is responsible for this is called the Entity Matcher. The Entity Matcher matches all the incoming affiliation names against its own internal database of affiliation variants, and if a match is found, it provides a new field called emm_affiliation__ename. On top of that, a series of location-related fields is delivered, which accompany the respective original fields existing in each document. In general, the fields provided by the Entity Matcher have the letter “e” as prefix before each attribute, so a field ending in _name becomes _ename, _city becomes _ecity and so on. This is illustrated in Fig. 9.1.

_images/entity_matcher.png

Fig. 9.1 The Entity Matcher mainly provides an extra field called emm_affiliation__ename to each doc, containing a disambiguated value for the affiliation name.

9.2. Affiliation Location

The location information is tightly linked to the affiliation name information. In the Entity Matcher internal database, each affiliation ename is linked to name variants kept from the various data sources, and each name variant is in turn linked to location information. The location that is most frequent among those variants is the one that the affiliation ename is going to be associated with. This is illustrated in Fig. 9.2.

_images/EntityMatcherExcerpt.png

Fig. 9.2 The Entity Matcher database contains variations on each affiliation, and location information for each of the variations. The most frequent location characterizes the affiliation.

Example

In Fig. 9.2, the most frequently appearing variant for JRC in the Entity Matcher internal database is located in Ispra, Italy. The location information tied to JRC then will always be “Ispra, Italy”. Some of the location fields tied to this emm_affiliation__ename will be:

emm_affiliation__ecity: Ispra
emm_affiliation__ecountry: Italy

The original location fields will still be retained in each document. These might be, e.g.

emm_affiliation__city: Sevilla
emm_affiliation__country: Spain

It must be stressed that all the Entity Matcher-attributed fields depend on the successful matching of the organisation.

9.2.1. European Locations

TIM generates specific fields for the study of the organisations located in Europe. These fields respond to the need to visualise either all the EU countries together or only the EU countries. For those cases, the fields emm_affiliation__eucountry and emm_affiliation__eocountry are created, based on the country information, and so are the respective Entity Matcher fields, using the country information of the Entity Matcher.

emm_affiliation__eucountry is generated by replacing the name of the countries that are members of the European Union by the value EU. All the other values corresponding to non-EU countries stay unchanged as in _country. This makes it possible to build analyses, in which all the european organisations appear as one unit and the rest of the countries appear as separate entities.

emm_affiliation__eocountry, on the contrary, is used to analyse exclusively the affiliations of EU countries. In this case, the country name is kept when the country is an EU member, whereas the country name is removed (i.e. replaced by an underscore, “_”) when the country does not belong to the European Union.

Examples

emm_affiliation__country: Germany
emm_affiliation__eucountry: EU
emm_affiliation__eocountry: Germany
emm_affiliation__country: United States
emm_affiliation__eucountry: United States
emm_affiliation__eocountry:_

9.3. Merged Values

There are cases where the emm_affiliation__ename will be empty (e.g. because the algorithm couldn’t match the emm_affiliation__name with a known organisation), and thus all the (location-related) associated fields will also be empty. For those cases, a special set of fields starting with emm_affiliation__mrg* exists. These fields are, in general, identical to their emm_affiliation__e* counterparts, except that, where there is no match from the Entity Matcher, the original name and location fields from the document will be used. This is a compromise, because it will not save you from duplicates or mistakes; on the other hand it will still provide data when the Entity Matcher fails to.

So, field emm_affiliation__ename becomes emm_affiliation__mrgename, emm_affiliation__ecity becomes emm_affiliation__mrgecity and so on.

The merged fields are all produced by a transformation, as illustrated in Fig. 9.3. You can read more about the transformation mechanism and how you can use it in Transformations.

_images/mrgename.png

Fig. 9.3 The field emm_affiliation__mrgename provides emm_affiliation__name when emm_affiliation__ename is empty.