Information SearchesIntroductionWith the development of information technology more and more data is being stored in electronic and other forms. Finding the correct data especially from the electronically stored information is becoming more important by the day. Information research aims at developing models and algorithms for the purpose of information retrieval from document repositories. For effective Information Retrieval it is necessary to understand how search engines work. Information Retrieval (IR) is defined as the science of searching for information in documents, searching for documents themselves, searching for metadata which describes the document or searching within databases, whether relational stand-alone databases or hypertextually-networked databases such as ‘World Wide Web’. (wikipedia)Process of RetrievalInformation Retrieval is retrieval of unstructured data.
It could be retrieval of documents or specific information in the documents. It could also be retrieval of speech or images. When the user needs some information, he converts it into a query as a formal statement and the Information Retrieval system finds the relevant information. Most of the information retrieval is done from texts. The query formation is based on ‘bags of words’, which is a phrase or group of words.
Due to constant growth of text documents, ‘bags of words’ do not get precision in the results. Synonymous words are one type of challenge. Another challenge is that many a times a group of words may have a total different meaning to the individual words. For example, Hot Dog as a group of words has no similarity to Hot or Dog. Ranked retrievalRanked retrieval starts with a query and calculates relevance score between query and every document. It sorts documents by their score and presents the top scoring documents to the user.
Score computing is done in three stages: -• Quorum scoring• Term frequency (TF) weights• Inverse Document Frequency (IDF) weightsFor example if the original query is “The Amazonian rain forests”, in Case normalization it will be like “the amazonian rain forests”This is Stop word removal where function words are removed and the query remains as “amazonian rain forests”. In Suffix removal, also know as Stemming, it reads like” Amazon rain forest”. Quorum scoring is used in documents with the largest numberof query terms. Term frequency means More often a term is used More likely document is about that term.
This Depends on document length also. There is a Formula for Term frequency, Log(t+1)Term Frequency - -----------Log(dl)where-• t: Number of times term occurs in document• dl: Length of documentInverse document frequencyThis means that if the term is occurring more number of times or more frequently, it is bad and if it is less frequent, it is good. This is based on the research that more frequently occurring terms are more general and less frequently occurring are more specific.
This measure gives high value to terms if they occur in lesser number of documents. From this we arrive at a formula: -NLog reciprocal- log( -------)nwhere n : number of documents term occurs in