Indexing content
Hi every one !
This post is about Indexing and we will in the first part concentrate on indexing textual information.
Document
From Wikepedia, a document is a written, drawn, presented, or memorialized representation of thought.
Another definition if from the ISO-TC 46, which defined a document as a combination of an information content and a
supporting physical carrier.
Indexing is the process of storing an intity in such that is will be easy to retreive or search for some part of it.
Document = information + support carrier
Information media source:
- sound
- still image
- text
- sequences of images ( video )
Support carriers :
- Paper
- digital
- web
- magnetics
In this post will focus on text
Indexing text, is translating the text into a language specific design. This can be done in 03 steps:
1 Identifying the index terms
2 Choosing the weights to assign to the indexed terms base on the importance given to a term
3 Structure the indexes
We have techniques like:
- Inverted files
- Hashing tables
- Tree structures
- etc
Index terms: Identifying index terms
Index terms are elements that describe the content of the text and it structure(chapter, section, …)
Two major languages are use to do this task :
1 Controlled language
It is a predefined language also call thesaurus and is used to manually annotate text
2 Uncontrolled language
In this case, there is no predefined language, index terms are generally chosen according to the contents of the text
to index. The text may be automatically annotated.
Chosen weight for index terms
Most used techniques are automatic ones based on count or frequencies (TF : term frequency and IDF: inverse document
frequency), or on vector transformation of text(Word2vec, Glove).
The main difference between the count and the vector transformation is that the count method transforms elements of the
the text independently while the vector transformation will keep the relationship between element(word ou sentence for
for example).
Others may be independent of the document and a function of the element types.
Structuring indexes
Final task of indexing, make use of :
- inverted files
- Hashing tables
- Tree structures