TimDocs 0.3 documentation

2. Search

«  1. Getting Started   ::   Contents   ::   3. User Interface  »

2. Search¶

Contents

  • Search

    • Create a dataset

    • How to build a search query

      • Fields

      • Terms

        • Stemming

        • Types of terms

        • Term modifiers

          • Wildcard Searches

          • Proximity Searches

          • Range Searches

      • Operators

        • Grouping

      • Search Query Examples

    • Validate Dataset

      • Assess the quality of the results of the Search - Recall and Precision

    • Advanced

      • Alternative query syntax

    • Common Pitfalls

      • Not all quotes are created equal!

      • Over-use of the asterisk *

      • Using asterisks with values containing spaces

      • Stemming and wildcard searches

2.1. Create a dataset¶

A dataset is a collection of documents that can be analysed and visualised together. Datasets are normally defined using a structured search string that retrieves the matched documents from the TIM database.

To create a new dataset, click on CREATE Dataset in the navigation panel (3).

datasets

The Create Dataset page is loaded. Type a name in Dataset name and enter the search string in the field Query (see the next section on how to build a search string). The option Model should be left with the default value (normal). The option Type of Search refers to the group of documents you want to search in. The current options available are:

  • Search on Solr: searches on publications from Semantic Scholar, patents from patstat and EU projects from CORDIS.

  • Search on COVID-19 Open Research Dataset (CORD-19): searches on publications on COVID-19 and coronavirus-related documents.

The Model option offers some dataset processing options to the power user, among which is the ability to lift the 10,000 document limitation. As this is generally not encouraged for reasons of both performance and readability of the graphs, these options are hidden. If you have a specific need for that kind of functionality, you can contact us at JRC-TIM-SUPPORT@ec.europa.eu.

createDataset

2.2. How to build a search query¶

The aim of this section is to provide some guidelines for TIM users when designing a search query. The search functionality is based on queries using a specific syntax. The query will retrieve documents from the TIM database of scientific publications, patents and EU granted projects, which will constitute the dataset. The dataset can then be visualised through the various pages available.

A search query is composed of a combination of fields, terms and operators.

Note

  • Terms refer to what you want to search for

  • Fields refer to where you want to search for it

  • Operators help you combine more than one field or more than one term, to construct more elaborate search queries.

2.2.1. Fields¶

Data indexed in TIM is structured and organised in fields, such as title, abstract, location, year. For this reason, when forming a query, the fields to be searched need to be specified. Depending on the group of documents selected earlier, the fields available differ.

The most important fields that are searchable for CORD-19 are the following:

operator name

Searches in

title

field title of documents.

description

field description or abstract of a document. containing the searched terms in the title or in the description.

text

full text of the document (when available).

emm_year

publication year of the article

emm_journal

journal of the publication

emm_autoKeywords

Automatic keywords generated by TIM

emm_doi

DOI of the article

guid

unique TIM ID of the article

emm_cord_uid

unique CORD-19 ID

emm_affiliation__city

City of the organisation publishing

emm_affiliation__country

Country of the organisation publishing

emm_affiliation__countryCode

Country code of the organisation publishing

emm_affiliation__name

Name of the organisation publishing

emm_author__name

authors’ names

The most important fields that are searchable for Semantic Scholar are the following:

operator name

Searches in

Refers to the

ti

field title of documents.

title of the publication, patent or EU granted project.

ti_abs

fields title and abstract of documents.
It will retrieve documents containing the searched terms in the title or in the description.

abstract of the publication, patent or EU granted project.

topic

Used to search in the fields title, abstract, author keywords and automatic keywords of documents. It will retrieve documents containing the searched terms in the same fields as title_abs_key and in the automatic keywords.

Automatic keywords are terms generated automatically by natural language processing algorithms. Automatic keywords are generated for all types of documents.

au

Used to search in the field author or inventor of documents.

Author refers to the name of the author in a publication or the name of the inventor of a patent.

org

Used to search for the organisation publishing an article, applying for a patent (applicant) or participant of a grant.

The information about organisation is retrieved from: affiliation of authors, applicants of patents, participants in EU project.

Users can specify a field, by typing the field name followed by a colon : and then the term or phrase of interest within that field.

For example, to find a document that contains the term electrolyte in the title, the query should be constructed as follows:

ti:electrolyte

The field is only valid for the term that it directly precedes, so the query

ti:functional dynamics

will only find functional in the title field and will ignore the term following it. If you want to search for the phrase functional dynamics, the query should be within double quotes.

ti:"functional dynamics"

This is called an exact search.

Exact searches can also be made in a few additional fields:

class

Usage

emm_year

Use the year number or for periods use [year TO year]

emm_affiliation__ecountry

emm_affiliation__ecity

emm_affiliation_nuts

The exact English version of names of countries and cities can be searched. For NUTS (Nomenclature of Territorial Units for Statistics), you can refer to the full list at Eurostat[1].

emm_classificationCPC

Each patent belongs to one or more CPC classification. For the nomenclature of patent classes see Espacenet.[3]

Exact search means that you have to put the exact text (case sensitive) if you want to find a match.

So, if for example you search for emm_affiliation__ecountry:United, this will not match United Kingdom, or any other country starting with United and followed by more characters.

Also, emm_year:99 will not work. You can, however, write emm_year:*99, or emm_year:19*, as single term modifiers will work as usual.

To filter by document type in the search string, use the following:

Documents Class | To be used in the search string

Patent

class:patent

EU project

class:euproject

Note

Not all the available fields are described in this document. Please contact us for more complex searches.

2.2.2. Terms¶

2.2.2.1. Stemming¶

Before talking about terms in general, it is important to mention stemming. Stemming is a process that reduces words to their word stem, or root. For example:

technologies becomes technolog-

This means that if you search for the word technologies or technology, TIM will understand you are looking for documents containing a word whose stem is technolog and hence will match:

technology, technological, technologically.

TIM applies stemming to the search terms inputted by the users, in order to widen the set of matched documents. It also simplifies the search query, as the user doesn’t have to input all possible variants of some words.

That said, there are some cases that should be handled with care. If, for example, the user makes a search:

topic:"operating system"

then the stem of the word operating is operat, which means that this will also match operative system and operational systems, which might not be what the user had in mind. In this case, the user should try to narrow down the search by contextualising it to the area of interest, so as to avoid matching the other variants.

This also has implications on how Term modifiers are used.

TIM uses the de facto standard algorithm for stemming (Porter’s), and there exist a few stemming testers freely on the internet, on which the user, if not certain, can test how a word gets stemmed. For an example webpage, see this.

It is advisable to consult with stemmers because sometimes results can be surprising. For example, in the case of printers and printing, it turns out that these two words don’t have the same stem!

  • Printers’s stem is printer-

  • Printing’s stem is print-

See Stemming and wildcard searches on how this can impact a search query.

2.2.2.2. Types of terms¶

There are two types of terms: single terms and phrases

  • A single term is a single word, such as emission or nanoparticle.

    Again, it’s important to understand that what is searched for is words whose stem is emiss- and nanoparticl-, respectively.

  • A phrase is two or more words surrounded by quotation marks, such as

    "rapid prototyping method" (="rapid prototyp- method")

When a phrase is used in a search query, documents matching the exact phrase will be retrieved.

A search for the phrase "rapid prototyping" will not retrieve the documents containing the phrase rapid and cheap prototyping, but will retrieve the ones containing solid-based rapid prototyping.

Note

Some of the fields in TIM store text in a different way. For example, when the text in a field is actually representing identifiers or codes, where spaces might be significant, the text is stored as a continuous string. More information is covered in chapter Fields in TIM.

2.2.2.3. Term modifiers¶

Term modifiers allow the User to introduce flexibility or precision in the terms used for a search. They include wildcard characters, characters for making a search “fuzzy” or more general, and so on. The sections below describe these modifiers in detail.

2.2.2.3.1. Wildcard Searches¶

TIM search engine supports single and multiple character wildcard searches within single terms. Wildcard characters can be applied to single terms, but not to search phrases.

?

Single character (matches a single character).

*

Multiple characters (matches zero or more sequential characters)

For example, the search string:

ti:te?t

will match both text and test in the title of the document.

The search string:

ti:electroly*

will match electrolyt-e, electrolyt-ic, electrolys-is, electrolys-er and electrolyz-er and others in the title of the document. In reality, it will be matching stems electrolyt- and electrolys-.

You can also use wildcard characters in the middle of a term. For example:

ti:leuk*mia

would match leukaemia and leukemia in the title of the document.

Last, you can use wildchard characters in the beginning of a term. For example:

ti:*leukaemia

would match, besides leukaemia, erythroleukaemia.

2.2.2.3.2. Proximity Searches¶

A proximity search looks for terms that are within a specific distance from one another.

To perform a proximity search, add the tilde character ~ and a numeric value to the end of a search phrase, and enclose the whole block in parentheses. The number signifies the distance that two words can be apart, or the number of word movements that need to be performed to get from the matched string to the phrase we have provided. The latter concept is a bit more complicated, but works better when using proximity search on phrases with more than two words.

Example

To search for nuclear and waste within 5 words of each other in the topic of a document, use the search:

topic:("nuclear waste"~5)

In this example , the search would match a document containing the following text:

…nuclear and biological waste….

and also a document containing the text “….in nuclear plants, radioactive waste is treated……”.

Example

If the user types:

topic:("low-dose radiation"~4) (note the use of parenthesis in both cases)

then this is one of the results: “…induced by low dose irradiations; its impact on radiation protection”.

The distance referred to here is the number of term movements needed to go from the specified phrase, low-dose radiation to the matched phrase. It is convenient to visualise this as consecutive word transpositions/insertions of the unrelated terms (irradiations, its, impact, on) outwards, until they are out of the specified phrase (or in, depending on your starting point).

The same result would have been retrieved by using:

topic:("low irradiations impact"~4)

because the same number of words needs to be inserted. If, however, you were to give the query with e.g. two terms reversed:

topic:("irradiations low impact"~4)

then this would no longer work with ~4, you would have to use ~5, because an extra transposition is required.

2.2.2.3.3. Range Searches¶

A range search specifies a range of values for a field (a range with an upper bound and a lower bound). The query matches documents whose values for the specified field or fields fall within the range. Square brackets [ ] denote an inclusive range query that matches values including the upper and lower bound.

For example, one could search in the field relating to the year of publication using:

emm_year: [2002 TO 2004]

2.2.3. Operators¶

Boolean operators allow you to apply Boolean logic to queries, requiring the presence or absence of specific terms or conditions in fields in order to match documents.

The table below summarizes the Boolean operators supported by TIM.

AND

All search terms must occur to be retrieved

The search string:

(nanomaterial AND toxicity)

will retrieve documents that contain both terms.

OR

Any one of the search terms must occur to be retrieved. Use when searching variants and synonyms.

The search string:

(nanomaterial OR nanotube OR nanoparticle)

will retrieve documents that contain at least one of the terms.

NOT

Excludes records that contain a given search term

The search string:

(nanomaterial NOT inflammation)

will retrieve documents that contain nanomaterial excluding any which also contain inflammation.

In order to be recognised as such by the software, the Boolean operators should always be used in capital letters.

2.2.3.1. Grouping¶

a) Parentheses are used to group terms into sub-queries, or to group sub-queries into larger queries. This allows you to control the Boolean logic for a query.

If, for example, you have variants for the concept of nanomaterial, you can combine them with multiple OR operators, and enclose everything in parentheses:

topic:(nanomaterial OR nanotube OR nanoparticle)

Do not ommit the parentheses!

topic:nanomaterial OR nanotube OR nanoparticle

would only yield results for topic:nanomaterial and discard the rest of the query.

You can further craft your search query, for example requiring that the term toxicity exists, along with either of the nanomaterial terms you are using.

topic:((nanomaterial OR nanotube OR nanoparticle) AND toxicity)

  1. Parentheses have to be used always when combining terms in one search field, but are not strictly necessary when combining sub-queries of different fields.

    topic:(nanotube AND toxicity) AND location:Switzerland

Here you have to enclose nanotube AND toxicity in parentheses, but you don’t have to also enclose the whole line in parentheses. However, if you add more Boolean operators, confusion will be unavoidable if you don’t enclose your sub-queries in parentheses.

topic:A AND location:B OR class:C

is different to

topic:A AND (location:B OR class:C)

so you should make sure to be as explicit as possible, by always using parentheses.

  1. If two terms are written consecutively inside parenthesis, without using quotes, then the AND operator is implied.

For example, if you write:

topic:(nanomaterial toxicity)

it will be treated as if you had written

topic:(nanomaterial AND toxicity)

  1. It should also be noted that, for fields combining more than one field, such as topic and ti_abs_key, TIM will try to match all the terms given, in one field after the other.

So, for example, if you search

topic:(nuclear AND risk)

TIM will first search in the title for both nuclear and risk. If it fails, it will then proceed to search in the description, if it fails it will move on to the author keywords, and so on and so forth. This is illustrated in the article below, where the above search fails (i.e. the document below will not be retrieved), as the two words are never found together in the same field.

publicationEg

If you want to match for both the words anywhere in the topic, i.e. you don’t care if nuclear is in the title and risk is in the keywords, then you have to explicitly state it, by searching:

topic:nuclear AND topic:risk

2.2.4. Search Query Examples¶

topic:((nanoparticle OR nanogranule) AND medicine)

key:``synthetic biology``

ti_abs_key:soil AND org: ``joint research centre``

ti:(3D AND print) NOT emm_journalCategoryScopus: ``Biophysics OR Biochemistry``

key:(GNSS OR ``global navigation satellite system)``

topic:( ``low-dose radiation``~5 AND protect*) AND emm_year:[2002 TO 2004]

2.3. Validate Dataset¶

The goal of the search functionality in TIM Technology is to define a dataset to be visualised and interpreted.

A dataset is a set of patents, publications and projects that are specific to a scientific domain, field or technology that has been defined by the user. The quality of the dataset is of utmost importance, as it will constitute the basis for all the visualisation(s). When the user submits a search query, a screen for the validation of the dataset is displayed.

validateDatasets3

The first twenty results of the search performed by TIM will be displayed to assess the relevance of the search strategy used. Once created, the dataset can be saved by clicking on the Submit button.

If the preview of the results seems to indicate that the results are not relevant, the search string needs to be refined or modified. To do so, you can use the back button.

TIM Technology limits the size of datasets to 10.000 documents. If a search results hits that maximum, it is recommended to refine the search in order to have a smaller, more meaningful dataset that is easier to visualise and interpret.

2.3.1. Assess the quality of the results of the Search - Recall and Precision¶

Bibliometricians have been using the concepts of precision and recall to measure the quality of an information retrieval. Precision is the ratio of the number of relevant records retrieved to the total number of records retrieved (irrelevant and relevant). Recall helps to measure the proportion of relevant documents out of all the documents retrieved. It is calculated as the ratio of the number of relevant records retrieved to the total number of relevant records in the database. It is usually expressed as a percentage. Recall and precision are inversely related. It is important to keep these concepts in mind when designing a search and to balance them according to the goal of the search strategy.

2.4. Advanced¶

2.4.1. Alternative query syntax¶

There is an alternate query syntax that can be used in cases where the query becomes too long for TIM to parse. An often-seen case is when there is a list of a thousand documents, refered to by their IDs, connected by OR clauses. The format is:

_query_:"{!terms f=<FIELD>}<TERMS>"

where <FIELD> is a search field, such as emm_doi and <TERMS> is a list of the terms that you want to join together by OR, in this case, the actual DOIs. For example,

_query_:"{!terms f=emm_doi}10.1024/0939-5911/a000458,10.1007/s00038-016-0830-z,10.1093/pubmed/fdw040,10.1016/j.socscimed.2017.04.031"

This is equivalent to:

emm_doi:(10.1024/0939-5911/a000458,10 OR 1007/s00038-016-0830-z,10 OR 1093/pubmed/fdw040,10.1016/j.socscimed.2017.04.031)

Caution

this will only work for specific fields, such as ID, year, country. For a full list, see here.

2.5. Common Pitfalls¶

2.5.1. Not all quotes are created equal!¶

topic:
(
(4dprinting OR “4d printing”~2 OR “4-dimensional printing”~2)
OR
(
(4dprinting OR “4d printing”~2 OR “4-dimensional printing”~2)
AND
(“Programmable Materials” OR “Programmable matter” OR “Self*assembly” OR “Self*fold*”)
)
)

Why does this search term produce 34400 results ? For the same reason the following gives 34280!

topic:(“4d printing”~2 OR "4-dimensional printing"~2)

Do you notice something strange about this query? the quotes used for the first term are different than the quotes used for the second term!

“4d printing” and "4d printing" is not the same thing!!

The short story is that you definitely want to avoid using “ ”. This set of quotes usually comes from copying and pasting from and to microsoft office applications. Make sure to use just notepad or wordpad when copy-pasting, so that the text is preserved as is.

The long story:

The “ ” are not regarded as quotes, they are regarded as normal characters, so what the system sees is:

  • first term: “4d [a word consisting of a weird character, followed by 4d]

  • second term: printing”~2 [the word printing followed by a weird character, and a modifier which in THIS case means that you want a word that looks like printing”, more or less!]

  • third term: "4-dimensional printing"~2 [the exact term “4-dimensional printing”, with the possibility that one or two words are put somewhere in between, or the order is changed]

So what you were giving the system was A B OR C, which is going to be interpreted as A AND B OR C, and who can remember whether that is (A AND B) OR C, or A AND (B OR C) …

2.5.2. Over-use of the asterisk *¶

Take, for example, the term "Self*fold*". You can write "self-fold" instead, without any asterisks (but always in quotes)

The first asterisk can be omitted, because if you look for "self-fold", the system will also automatically look for "self fold".

The second asterisk can be omitted because stemming does that for you automatically: when you look for "fold", "folded" and "folding" will also be matched.

However, keep in mind that, for terms separated with a dash that can also be found as single words, you will have to provide the single word variant yourself. For example, cross-sectional can also be found in documents as crosssectional, so you have to make a query like so:

title:(cross-section OR crosssection)

2.5.3. Using asterisks with values containing spaces¶

Some special field values may contain spaces that are significant, e.g. emm_classificationCPC’s values usually have one or more spaces.

Searching for emm_classificationCPC:"G09G   3/20912" works as expected (that’s three spaces in there!). It becomes a bit more complicated when the User wants to search e.g. for all classification codes that start with G09 G   3/20 and end in any number. Adding an asterisk, emm_classificationCPC:G09G   3/209* will not work, as the parser thinks this is emm_classificationCPC:G09G AND <something else>. This is because what is after the space is not taken into account and the parser expects an operator at that position.

In these cases normally the value searched would be put within quotes, but asterisks don’t work with quotes.

So searching for emm_classificationCPC:"G09G   3/209"* won’t work either (please note that if the original search was without asterisk, the quotes would have worked as expected).

For this specific case to work, each space must be escaped:

emm_classificationCPC:G09G\ \ \ 3/209*

Note that each space has to be escaped separately. If you are not sure about the number of spaces, you could also attempt:

emm_classificationCPC:G09G\ *3/209*

2.5.4. Stemming and wildcard searches¶

Beware of cases such as printers and printing, as mentioned in the section above (Stemming):

  • title:(3D AND print*) will return documents that include 3D AND (printing or printers or printer etc) in the title.

  • title:(3D AND print) (without the asterisk) will return documents that include 3D AND printing but not printers.

«  1. Getting Started   ::   Contents   ::   3. User Interface  »

© Copyright 2018-2021, Geraldine Joanny, Sotiri Fragkiskos. Created using Sphinx 4.0.2.