GLOSSARY
PICTOGRAMS DEFINITION
 
GET EXTRA INFO

 
CLICK AND PLAY

Understand the public search engine
Discover indexation and weighting of content in SPIP
All the versions of this article: فارسى | English
Lessons you should have completed:
Understand articles and news items
Use the Search Engine
Reference sites and work with syndication

What will I learn to do?

The SPIP public search engine is very easy to use and yet it is quite powerful.

Most users do not need to know the ins and outs of how the search engine works but webmasters or administrators may want to understand a little more about it.

CAUTION

This lesson is only about the public search engine and not the private one.

Unlike the public search engine, the private search engine does not rely on indexation and the results do not take into account any weight calculation.
The private search engine is less likely to be intensively used than the public one, so the gain of performance and accuracy enabled by indexation is not crucial.

In this lesson you will learn:

- What indexation is, when and how content is indexed in SPIP

- What weighting is

- How indexation and weighting condition the search results

- How to extend the search to sites you have referenced

What is indexation?

In a traditional back-of-the-book index the words (or phrases) are concepts selected by a person and the pointers are page numbers. Indexes are designed to help the reader find, quickly and easily, the pages where the information he is looking for is located.

The SPIP index has exactly the same purpose.

It allows the search engine to find the reference of an article or news for a given word without having to read the entire database.
But in contrast with the index of a book, the SPIP index is constantly evolving because the content keeps changing.

AUDIENCE VIEW

Search results page for a search on the word women

Suppose that a search on the word women is done and that the word women is referenced by the index in articles 1 and 2.

- Without indexation, all the articles would have to be read in order to find the word women; on a large site this could take a long time even for a computer.

- With an index, all we have to do is look for the entry women in the index: we can immediately see that the word is present in articles 1 and 2.

How is the SPIP content indexed?

CAUTION

The content that is indexed includes all the parts of articles, news items, but also the names and descriptions of the sections, the keywords, the syndicated content alredy available on your SPIP site, etc.

However, only the words that are more than three letters are indexed in SPIP.

The principle of indexation is the following: in every text of the site, each word is extracted and entered into a database together with its location.

Article 1
...Police and fire inspectors are creating a risk to public safety – sometimes resulting in death of innocent people - by taking bribes from individuals and businesses who violate the law, a national investigation has found...

Article 2
...Companies are willing to hire illegal Baharistani workers to get them cheaply, and the government inspectors can’t catch them all, an official says...

Article 3
...But government officials say their hands are tied by the limited budget that they are struggling to cope with...

Article 4
...Police and fire fighter union officials would not comment until they had seen the report…

Let us extract the words out of each article and record for each word which article it belongs to:

- Safety: article 1
- Police: articles 1, 4
- Inspectors: articles 1, 2
- Government: articles 2, 3
- Official: articles 2, 3, 4

...and so on, bearing in mind that our site is likely to be much bigger and articles much longer than this.

When is the SPIP content indexed?

Indexation is done at three different times:

- When you publish an article. It is then immediately indexed.

- When you modify an article which is already published. It is then indexed again.

- Each time someones visits the public site and accesses an item which is not indexed (for instance if the Administrator has just deleted the indexation data, or restored a database back-up - indexes are not saved). It is then indexed as a background task.

CAUTION

Note that the indexation process is relatively resource-heavy: it requires numerous calculations (they are not very complex but they are carried-out on every word contained in the article) and triggers numerous calls to the database.

If the web host is very slow, it may be preferable to de-activate the search engine.

Hence note that if you activate the search engine after having published the articles, these articles will not be immediately indexed: visits to the public site will trigger their indexation.
On a large site this can take a while.