Determining the Relevance of Documents Using Natural Language Processing

Reddi1 · Post by **Reddi1** » Thu Jan 30, 2025 6:46 am

The fact that Google is already using NLP in many areas of search can be seen from the introduction of the BERT update and Google's Natural Language Processing API. The BERT update relates to the interpretation of search queries. That's why I won't go into it in more detail here.

In the end, it doesn't matter whether you apply NLP to a search term or a text or a text fragment such as a paragraph, sentence or sequence of words. The process is the same.

Tokenization : Tokenization is the process of dividing a sentence or switzerland phone number data text fragment into different terms.
Marking words according to parts of speech: Part of speech marking classifies words according to parts of speech such as subject, object, predicate, adjective…
Word dependencies: Word dependencies create relationships between words based on grammar rules. This process also maps “jumps” between words.

Example of Part of Speech Tagging and Dependency Parsing, Source: Explosion.ai Demo

Lemmatization: Lemmatization determines whether a word has different forms and normalizes variations to the base form. For example, the base form of animals is animal or of playful is game.
Parsing Labels: The label classifies the dependency or the type of relationship between two words that are connected by a dependency.
Analysis and extraction of named entities: This aspect should be familiar to us from the previous posts. This attempts to identify words with a "known" meaning and assign classes of entity types. In general, named entities are people, places and things (nouns). Entities can also contain product names. These are generally the words that trigger a knowledge panel . But terms that do not trigger their own knowledge panel can also be an entity . More on this in the article What is an entity? What are entities?

Example of an entity analysis using Google's Natural Language Processing API.

Salience scoring: Salience determines how intensively a text deals with a topic. This is determined in NLP based on the so-called indicator words. In general, the level of awareness is determined by the citation of words on the web and the relationships between entities in databases such as Wikipedia and Freebase. Google probably also applies this link diagram to entity extraction in documents to determine these word relationships. Experienced SEOs are familiar with a similar approach from TF-IDF analysis.
Sentiment analysis: In short, this is an evaluation of the opinion (view or attitude) expressed in an article about the entities discussed in the text.
Subject Categorization: At a macro level, NLP classifies text into subject categories. Categorizing topics helps determine what the text is about in general terms.
Text Classification & Function: NLP can go even further and determine the intended function or purpose of the content.
Content type extraction : Google can use structural patterns or context to determine the content type of a given text without using structured data. The HTML, formatting of the text, and the data type of the text (date, location, URL , etc.) can be used to understand the text without additional markup. This process allows Google to determine whether text is an event, recipe, product, or another content type without the need for markup.
Identifying implicit meaning based on structure: The formatting of a body of text can change its implicit meaning. Headings, line breaks, lists, and proximity provide a secondary understanding of the text. For example, if text appears in an HTML-sorted list or in a series of headings with numbers in front of them, it is likely to be a process or ranking. Structure is defined not only by HTML tags, but also by the visual font size/weight and proximity when rendered.