Indexing content

Hi every one !
This post is about Indexing and we will in the first part concentrate on indexing textual information.

Document

From Wikepedia, a document is a written, drawn, presented, or memorialized representation of thought.
Another definition if from the ISO-TC 46, which defined a document as a combination of an information content and a supporting physical carrier.
Indexing is the process of storing an intity in such that is will be easy to retreive or search for some part of it.

Document = information + support carrier

Information media source:

sound
still image
text
sequences of images ( video )

Support carriers :

Paper
digital
web
magnetics

In this post will focus on text

Indexing text, is translating the text into a language specific design. This can be done in 03 steps:
1 Identifying the index terms
2 Choosing the weights to assign to the indexed terms base on the importance given to a term
3 Structure the indexes
We have techniques like:

Inverted files
Hashing tables
Tree structures
etc

Index terms: Identifying index terms

Index terms are elements that describe the content of the text and it structure(chapter, section, …)
Two major languages are use to do this task :
1 Controlled language
It is a predefined language also call thesaurus and is used to manually annotate text
2 Uncontrolled language
In this case, there is no predefined language, index terms are generally chosen according to the contents of the text to index. The text may be automatically annotated.

Chosen weight for index terms

Most used techniques are automatic ones based on count or frequencies (TF : term frequency and IDF: inverse document frequency), or on vector transformation of text(Word2vec, Glove).
The main difference between the count and the vector transformation is that the count method transforms elements of the the text independently while the vector transformation will keep the relationship between element(word ou sentence for for example).
Others may be independent of the document and a function of the element types.

Structuring indexes

Final task of indexing, make use of :

inverted files
Hashing tables
Tree structures