Information Retrieval, Inverse Document Frequency Coursework Example | Topics and Well Written Essays

Information Searches Introduction With the development of information technology more and more data is being stored in electronic and other forms. Finding the correct data especially from the electronically stored information is becoming more important by the day. Information research aims at developing models and algorithms for the purpose of information retrieval from document repositories. For effective Information Retrieval it is necessary to understand how search engines work. Information Retrieval (IR) is defined as the science of searching for information in documents, searching for documents themselves, searching for metadata which describes the document or searching within databases, whether relational stand-alone databases or hypertextually-networked databases such as ‘World Wide Web’.(wikipedia) Process of Retrieval Information Retrieval is retrieval of unstructured data. It could be retrieval of documents or specific information in the documents. It could also be retrieval of speech or images. When the user needs some information, he converts it into a query as a formal statement and the Information Retrieval system finds the relevant information. Most of the information retrieval is done from texts. The query formation is based on ‘bags of words’, which is a phrase or group of words. Due to constant growth of text documents, ‘bags of words’ do not get precision in the results. Synonymous words are one type of challenge. Another challenge is that many a times a group of words may have a total different meaning to the individual words. For example, Hot Dog as a group of words has no similarity to Hot or Dog. Ranked retrieval Ranked retrieval starts with a query and calculates relevance score between query and every document. It sorts documents by their score and presents the top scoring documents to the user. Score computing is done in three stages:- • Quorum scoring • Term frequency (TF) weights • Inverse Document Frequency (IDF) weights For example if the original query is “The Amazonian rain forests”, in Case normalization it will be like “the amazonian rain forests” This is Stop word removal where function words are removed and the query remains as “amazonian rain forests”. In Suffix removal, also know as Stemming, it reads like” Amazon rain forest”. Quorum scoring is used in documents with the largest number of query terms. Term frequency means More often a term is used More likely document is about that term. This Depends on document length also. There is a Formula for Term frequency, Log(t+1) Term Frequency - ----------- Log(dl) where- • t: Number of times term occurs in document • dl: Length of document Inverse document frequency This means that if the term is occurring more number of times or more frequently, it is bad and if it is less frequent, it is good. This is based on the research that more frequently occurring terms are more general and less frequently occurring are more specific. This measure gives high value to terms if they occur in lesser number of documents. From this we arrive at a formula:- N Log reciprocal- log( -------) n where n :number of documents term occurs in N : number of documents in collection. When all this is put together- • tf•idf weighting log(t+1) N -------- * log ( ------ ) log (dl) n we find that this is the most successful way so far. The Core of most weighting functions are:- • tf (term frequency) • idf (inverse document frequency) • dl (document length) There are better ways that use other formulations of n, N, t and dl. This explains how documents are ranked relative to a query and outline the significance of a term based on its frequency of occurrence in a document collection and frequency of occurrence in a document. Advanced retrieval There are certain advance techniques used in retrieval as follows:- Term weighting. It defines the relative importance of the frequency of occurrence of a term in a document and other tricks Passage retrieval- Documents holding query words close together Are better. Passage retrieval splits the document into passages and ranks a document based on score of its highest ranking passage. A passage is a Bounded paragraph or a overlapping or Fixed window. For locating the phrases syntactic or statistical methods are used. This searches for the phrase in query and ups the scores of documents that hold the phrase. Statistical methods are better than the syntactic method. Another method is stemming – stemming is used to improve the search so that trivial word variations do not hinder matches. For example if the query is “statistics”, it should match with “statistic” and “statistical”? Stemming means removing affixes like – removal of suffixes - worker– prefixes? – megavolt and infixes like- “un-bloody-likely” There are certain rules regarding affix removers. So it is important to select ‘good’ terms from relevant documents and also be aware of the pseudo-relevance feedback. Plagiarism Retrieval methods and theories are important in the IR system. At the same time, there are other dimensions that need to be analyzed. These include style, genre, ease or difficulty of reading and real word application. Identification and proof of authenticity has always been used for legal and security systems. Genuine authorship has also been an issue in the field of literature and other creative fields. Analysis of the contents help identify the real writer of the text and also whether the text written by the author is not a copy of some other work. It has a very significant role in legal cases. It helps in distinguishing the confessions, witness statements, blackmailing etc. Many times the correct identification of the written text and its genuineness has altered the case. There is a specific linguistic identity of each text, which is represented by the theme , genre, and other style related features. Some parts of speech, like nouns are more closely associated with the subject of the text and some parts of speech like the articles or the propositions etc.( and, the, ..) are more closely associated with the author. The author’s style is identified by its vocabulary, choice of words from a similar meaning set of words, capitalization, frequency of words like ‘the’, ‘of’, ‘and’ and also by placement of words in the sentence. The length of sentences and use of punctuation etc are also distinguishing features of an author, which he uses unintentionally. Analyzing of text can be done manually or by the computer. Manual analysis is very slow and can have some mistakes. For analysis by computer we need to specify the features. In CUSUM or cumulative sum approach, two cusum charts are generated and compared, one for the sentence lengths and one for frequency of a specific habit. These are superimposed on the text to be analyzed and visually compared. This is based on certain assumption regarding the author’s style and is accepted by some courts. It help to identify whether the text is written by more than one person or not. But there is a debate on this method and some points have been raised against it, like- the selection of the habits or patterns, consistency of the patterns in any author, scaling of the axes while depicting the graphs, subjectivity of visual comparisons, probable variation in the graphs by addition of extra text in the original etc. Thus, the linguistic fingerprints of the authors and their usage as a reliable source is a debatable issue. Plagiarism- It is also called ‘text reuse’. It is reuse of a pre-existing written text to create a new text which can be either legitimate or illegitimate. Text reuse is legitimate when it is part of a creation a literary work. It can be a reuse by the original author himself of it could be reuse of newswire by journalists. Borrowing from own and others works has been a long tradition in creative fields. It is not considered wrong, rather it is accepted as an ‘inspiration’. Plagiarism normally refers to unethical reuse of text. Rewriting in a plagiarized work is normally done by insertion, deletion or substitution of words. This changes the exact words or the order of the words or sentences but still remains a copy. This term came in use from the 19th century. Most frequent cases of plagiarism is by students in academic field. The legal term for plagiarism is ‘copyright infringement’. It is considered a legal as well as social crime because the author is trying to benefit from the reader’s ignorance and misrepresenting someone else’s work as his own. With the growth of electronic documentation, the cases and probabilities of plagiarism have also grown. Plagiarism can be in different forms: - Word-for-word plagiarism: When phrases or passages from a published text are directly copied without quotation or acknowledgement. Paraphrasing plagiarism: When words or syntax are changed by rewriting but the original source is still obvious. Plagiarism of secondary sources: When a reference or quotation is used without looking at the original sources but obtained from a secondary source text. Plagiarism of the form of a source: When it is a copy of the structure of an argument in a source ,either verbatim or rewritten. Plagiarism of ideas: When it is the reuse of an original thought from a source text without using the words or form of the source. Plagiarism of authorship: When someone puts his name as the author on someone else’s work. Plagiarism detection works to find out four different possibilities:- 1. To find inconsistency in a single text to identify that the text is written by or not written by the claimed author. 2. To find possible sources of plagiarized text 3. To find collusion or collaborative writing of more than one text 4. To identify plagiarism or copying between more than one text Plagiarism can be identified manually by detecting a different vocabulary in the author’s work. An improvement or difference in the style as compared to his earlier works is also a clue for plagiarism. Inconsistent text in terms of vocabulary, style or quality; or incoherence in text can indicate direct copy and insertion from another electronic source. Similar mistakes in spellings, style, or content is also a give-away of plagiarized text. Plagiarism is found in referencing also when it is given in the text but not in the bibliography or any other inconsistency in the bibliography is also an indicator. Automatic plagiarism detection is used to assist the manual detection. This helps to reduce the time involved and also to search for possible electronic sources. This helps in correct identification of plagiarised and non-plagiarised texts. In an automatic comparisons of more than one text goes through three stages. First it is put in pre-process input text to represent it for comparison. In second stage, the texts representations are compared and in the third stage quantitative similarity and visual similarity is measured. In pre-processing stage the original sentence is processed for removal of function words using an IR stopword list. After that the affixes are removed using the Porter stemming alogrithm. Then the parts of speech are tagged using the CLAWS tagger_ with the C5 tagset and then it is parsed. For comparing the texts, word or n-gram overlap like CopyCatch and Ferret are used which is a simple method. More complex methods are used for sequence or alignment comparisons. It helps calculate a similarity score from comparision. A high similarity is indicator of plagiarism. String matching techniques are used by many on-line services like Turnitin.com, Plagiarism.org, MOSS and Digital Integrity. N-gram overlap represents documents as sets of overlapping n-word or character sequence known as n-grams. For example a 3-gram window considers a textual string of 3 consequent words and makes trigrams of the entire text starting from each word. Overlap of distinct n-gram is measured by using ‘containment’. If we take a pair of source text A and a derived text B, represented by ngram sets Sn(A) and Sn(B), n-gram containment Cn(A, B), is: | Sn (A) ∩Sn (B)| Cn (A,B)= ----------------------------- |Sn (B)| It measures the number of matches between elements of Sn(A) and Sn(B), scaled by the size of Sn(B). The assumption here is - as n becomes longer, it becomes less likely matching n-grams of this length will occur between non-derived texts. It is only likely when it is a very standardized sentence like a legal disclaimer or mandatory warning. There are soft wares to detect plagiarism. Ferret was developed by Dr. Caroline Lyon, visiting research fellow, school of Computer Science, University of Hertfordshire. It is based on tri-gram containment. Copy find was developed by University of West Virginia. It examines a collection of document files and compares them for matching words in a specified portion. While it can successfully report instance of plagiarism in the given documents, it can not compare them with any external source. Copy Catch is developed by Woolls and Coulthard of University of Birmingham. It works on hapax legomena, which is counting of overlapping of shared content words that happen only once. It is unlikely that different authors will share a high percentage of hapax legomena. Sequence comparison works on the ordering between the strings. Edit operations is the difference between two strings. It calculates minimum number of operations like insertions, deletion, transposition or substitution- needed to convert one string into another. It aims to align strings with minimum number of operations. If it is based on the entire string length, it is called Global alignment and if it is based on parts of the string then it is called local alignment. Levenshtein distance is the simplified version of edit distance. It decides the minimum ‘cost’ or number of operations to change from one string to the other. In longest common sub-sequence, the longest consecutive sequences are taken in order. There are many online services like Plagiarism.org, Digital Integrity, Copyscape etc. and many programs like Copycatch, Wordcheck, YAP3 etc. that help to detect plagiarism. UK universities use Turintin, an online service, which compares student’s works with other student’s works or other electronic sources. It can compare a work with a database of existing submitted works, more than 800 million websites and other essays from cheat sites. In its report, it highlights the plagiarized portions and also gives the sources. Evaluation We need to evaluate the IR systems to know whether they have any improvement and how much improvement. Relevance of the search is one important factor. Normally we get lots of relevant documents in the top 10. To measure the precision of the system, we use a formula:- Relevant and Retrieved Precision = ----------------------------------- Retrieved With all new systems, the evaluation process is to go through a set of queries and find out the precision for the specific queries and also the average for all the queries. We also need to check what the system failed to retrieve or recall. The formula to calculate the recall is :- Relevant and Retrieved Recall = --------------------------- Total relevant To calculate the total relevant, pooling is used which is assessed manually. It could be system pooling and it could be query pooling or system pooling. Site maps, display time and clickthroughs are few more techniques. Grouping of documents in Open directory is also used to locate relevant documents. Links also help in locating relevant websites with high precision. By comparing the test results, relevance can be checked and a rank cut-off can be set. Another way of evaluation is Mean Average Precision(MAP). A test collection can have more than one interpretation of a query, so we need to define relevance in a finer way. Relevance can differ as per the purpose of the search. It could be information, navigation or transaction. There are few more factors associated with searches. Recency, cost, plagiarism, readable and authoritative are some of the factors. Search length or time taken for the search is an important issue. If the users know how the search system works, they will be able to retrieve matter in a better way. Speech retrieval Just like searching the text documents, the search could also be aimed at speech retrieval. Speech is an important medium for information and we find its usage in meetings, debates and other formal forums as well as conversation etc informal forms. Nowadays speech related searches are gaining more and more popularity. Various radio and other multimedia channels like BBC Radio 4, National public Radio, YouTube etc are becoming very popular. Speech is an easy medium as it does not require a keyboard. But there is stil lots to be done in this area as the speech archives are underutilized. There are not enough user tools to access and use them. We do not find any support for searching and scanning or key word spotting. One of the techniques used in IR is content based search of the transcripts. These are generated by using the ASR or automatic speech recognition. Interface approaches use indices like speaker or visual information for searching. It is still in a development phase and needs to be perfected more. Textual IR works by breaking speech into chunks or audio paragraphs. The speech recogniser is able to recognize only the minutes of the speech. It combines the segments to make the continuity. Since there are factors like acoustics and outside noise, the error rate also depends upon these. It may also face a OOV or out of vocabulary case. Speech recognition works on some factors like language and vocabulary. It could be discreet or continuous, speaker dependent or independent of the speaker. It has certain challenges as well in speech recognition but with the improvements in the information technology and computers, it has been able to overcome most of these. Increased processor speeds and drop in the cost of RAM and disks, has also helped it. It has improved on the vocabulary capacity also. Now it has more that 60,000 words vocabulary compared to the earlier capacity of less than 100. It has become independent of the speaker and can recognize continuous speech. Word Error Rate( WER) has also become very low. TREC, the Text Retrieval Conference is an annual event of IR to support and encourage further research in the field if IR. In TREC 6, SDR was also covered although it started with a small data. It has evaluated retrievsl of speech documents from news broadcasts. 2000 evaluations have been compared with human transcripts of the same documents.( Garofolo et al. 2000). Although it depends on the ASR, the results were very good. The basic difference between processing speed of text and speech is that listening processing speed is 180 word /min and reading is 350 words/min. While reading is controlled by the reader, speech is not controlled by the listener but by the speaker. There could be two possible solutions to this. One is to construct external visual indices into speech and the other is to transcribe it using automatic speech recognition or ASR. Speech recognition works on recognizing the texts. These texts are either recognized transcripts or transcribed. Both are equally good. It can also combine multiple transcripts. In handling multiple transcripts, SDR uses the same technique as Term Frequency. TREC 8 has taken up some new areas. Its focus is on story segmentation and commercial removals. Vocabulary limitations, and retrieval of casual speech like general or telephonic conversation. It also has to overcome the challenge of not recognizing the foreign languages. Cross Language Information Retrieval Cross language retrieval is also called trans-lingual retrieval. It is a case when the source or query is in one language and the retrieving documents, also referred as target, are in other languages. In MLIR, the collection has documents in many languages and the query is also in many languages but it requires no translation. In monolingual IR, query and documents all are in the same language. Earlier this was called multi lingual IR. In 1996 at SIGIR workshop, the name ‘cross-language’ came up. This scenario has come up as the use and knowledge of other languages has increased. It is expected that very soon 70% Internet users will be using languages other than English. There is an increasing trend of retrieving documents for the purpose of translation. Another fact is that people learn to read a language before they learn to write it, so there is a growing number of people who want to read documents in other languages. Although the need is there, it has certain problems as the natures of the language differ from each other. For example Chinese has no space character, Japanese has four sets of alphabets, languages like German and Finish have long compound words, whereas languages like Arabic and Latvian have a very large number of characters. Apart from the structure of the language, the meaning of one word can be more than one. For example ‘grand’ on French can be translated as big or large or huge or massive. Similarly in case of phrases, also it can have a different meaning then individual word meanings. For example ‘petite dejeuner’ cannot be translated as Little Lunch as per the word meanings, and the correct translation ‘breakfast’ may not sound to carry the word-to-word meaning. Translation resources could be Machine translation system (MT) or using a Bilingual dictionary. Machine translation is expensive but it is good with ambiguity as it has hand built rules and takes the first guess only. It is designed for more complex purpose. We can see the application of Machine Translation in sites like Eurovision that can provide equivalent meaning of a word in source language to a word in target language. It works very well for some queries but does not work that well in some queries. Bilingual dictionary is very simple but has no built in support for ambiguity. When users are getting a machine-translated document they can take it as standard even if they are unable to know or to read. It has been seen that users can in general judge the relevance of a foreign language document that is retrieved after translation.(Information retrieval, 2004) When we are interacting in two languages that are similar to each other like French and English, the queries can be expanded in both the languages. But it may pose some problems regarding the proper names that are spelt differently in both these languages. They may not be listed in the dictionary as a meaning but they will need treatment as spelling correction problems.(Fuzzy Translations, 2003) We find that translating the query and translating the document have different set of challenges. While query may take lesser work but it also has much lesser evidence to base it on. A document may require more work but it has a larger base of evidence and ahs more chances to be correct. Sometimes translation form one language to the other may be simpler thane the other way round. There are two more resources of translation, namely Aligned bilingual or parallel corpora and comparable corpora. ‘Aligned bilingual’ or ‘parallel corpora’ is aligned at sentence level and is a direct translation of one text into another. For this one needs to find parallel texts from the web by crawling various sites. Comparable corpora is not a direct translation like the earlier one. It is aligned at the document level and it is designed to handle phrases and ambiguity. Both these are still rare and there is a need of doing more work in this area. There are still languages where no translation resource is available, like Portugese to Greek. In such cases an intermediate language can be used, meaning Portuguese can be translated into English and then English can be translated into Greek, as both these facilities are available. There could be a scope of using more than one pivotal language. Both can be tried and compared.( Gollins, 2001) A suggestion by Ballestros is to expand the query from a separate collection in the query language itself, before entering it for translation. Another strategy could be to expand the query after translation. Both of these show improvement. This system has given a very high retrieval accuracy. Image retrieval Image retrieval is another type of search that is coming up. Image search could be specific, generic or abstract. Since images cannot be expressed in words, their interpretation requires vast knowledge. A broad type of image may require low level descriptors. But some searches may require higher level descriptors. It is easier to define the descriptors when working in a closed set, like sorting paintings by their style or colours or finding the date of the photographs by their colours. So, image search is finding an application in this area. It is also used for checking copyright infringements by searching and comparing for known images from the collection. Similarly in trademark searches also it is helpful. It works much better for abstract images as they require lesser human interpretation. Videos are moving images. The production of videos, films and TV programs are increasing at very high rate. People normally search it by scenes or by story. Scene search is used in medical as well as sports field to study a particular scene. Search for content is similar to search for images. Metadata comprises of digital images. Digital cameras have automatic scene segmentation. DVD’s also have subtitles and audio descriptions that help in searches. In the field of 3D images it is not very clear what people want. Currently it is being used in VRML or Virtual Reality Mark-up Language. But it is not very practical in images. In VRML it is studying a collection of objects that are clearly defined but the same is not possible in the case of images. It can be used for searching Archive set designs but there are not enough images to search from. It may find usage in converting 3D models from photographs, mainly of buildings, or to giver different views of a building. AI and NLP Understanding how IR system works help in common research and commercial applications. We also discover that all techniques are interdependent on each other. Simple weighting of text searches is based on term frequency(tf), inverse document frequency (idf) and document length (dl). While this works well with stop words, it is not very perfect with others. There is another variation of it called Inverse Collection Term Frequency (ictf). The Formula for this is N+1 Log ( -----) F+0.5 Here F is the total number of occurrence of a term in a collection. Another variation is to tag query words with grammatical type. To make it more accurate the retrieval process needs to be mathematically modeled. This requires understanding of Vector space, Classic probabilistic, BM25 and language models. According to the Vector space, documents are represented as vectors of the query terms. It is a vector in N space where N is number of unique terms in collection. Based on document similarity theory, we can search the Relevance Ranking on the query in a document. By doing a comparison of deviation of angles between each document vector and the original query vector where the query is represented as same kind of vectors as the documents. It is more practical to calculate the cosine angle of the vectors rather than the angle. If the cosine value is zero then it implies that the query and the document vector had no match or the query term was not present in the document. In BM 25, term frequency has been utilized very effectively. Language models are used for speech recognition and machine translation. It works on making uni-gram and multi-gram models of language. It views each document as a language model. We can calculate the probability of query being generated from the document and compute it for all documents to get the ranking. Although most models consider the words to be independent of each other, term combinations have relevance as well. In the early attempts, the probability model has been unsuccessful where as the language models have been more successful. In ad hoc approximation of dependence it is seen that when a passage retrieval is attempted within query text, then it becomes more relevant that documents holding query terms are in close proximity. Also in phrase indexing the relevance of document holding query phase is relevant. In pseudo relevance feed back or automatic query expansion, and in spell correction also documents holding query terms are more relevant. All these methods of passage retrieval split the document into passages and rank it on score of its highest ranking passage. Stemming is used to make the query more effective by removing affixes (suffixes, prefixes, infixes). N-grams, rule based affix removers and statistical methods have been used in IR. Pseudo relevance feed back was introduced by Croft and Harper. It is based on the assumption of relevance of top ranking documents and marks them as relevant automatically. This may be treated as non-relevant by others. Smucker and Allan introduces another search tool called Find-similar. It evaluated a system using simulated users, which is a new form of relevance. Still lot of research is going on in this direction and with advancement of the technology, new solutions are also coming up. Conclusion In today’s day and time, IR is going to gain more importance every day. There is an increase in creation of the database or information in all the mediums, whether it is text, document, image, speech or videos. Simultaneously, the need of the user is also growing by the day. They are becoming more aware of the techniques as well as the requirements are also increasing. IR researches are working in this direction to improve the search results, reduce the error probabilities, make it more user friendly and find the shortcomings of the existing systems. As the new needs are coming up, the advancement in IR also needs to keep pace and adapt itself to all kinds of usage involving various languages or various mediums. Reference: 1. Allan, James (2001) “Perspectives on Information Retrieval and Speech,” in Information Retrieval Techniques for Speech Applications, Coden, Brown and Srinivasan, editors. pp. 1-10.http://ciir.cs.umass.edu/pubfiles/ir-236.pdf 2. Ballesteros, L., Sanderson, M. (2003) Addressing Ballesteros, L., Sanderson, M. (2003) Addressing the lack of direct translation resources for cross language retrieval, in the Proceedings of the 12th international conference on Information and Knowledge Management (CIKM) 147-152 3. Cross Language Retrieval with Triangulated Translation. In the Proceedings of the 24th ACM SIGIR conference, 90-95 4. Enser, P. & Sandom, C. (2002). Retrieval of archival moving imagery: CBIR outside the frame? Proceedings of the International Conference on Image and Video Retrieval(London, July 18-19, 2002). Berlin: Springer (Lecture Notes in Computer Science 2383). 206-214 5. Harman, D. (1992): Ranking algorithms, in Frakes, W. & Baeza-Yates, B. (eds.), Information Retrieval: Data Structures & Algorithms: 363-392 6. Krovetz, R. (1993): Viewing morphology as an inference process, in Proceedings of the 16th ACM SIGIR conference: 191-202 7. Hammarström H. (2006) Poor Man's Stemming: Unsupervised Recognition of Same-Stem Words. Proceedings of the Third Asia Information retrieval Symposium, AIRS 2006, pp. 323-333 8. K. Collins-Thompson and J. Callan. (2004.) A language modeling approach to predicting reading difficulty. In Proceedings of the HLT/NAACL 2004 Conference. Boston. 9. Lyon, C., Malcolm, J. and Dickerson, B. (2001), Detecting Short Passages of Similar Text in Large Document Collections, In Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing, 118-125. 10. Martin, B. (1994), Plagiarism: a misplaced emphasis, Journal of Information Ethics, Vol. 3(2), 36-47. 11. Melucci, M., Orio, N. (2003) A novel method for stemmer generation based on hidden markov models. In: CIKM’03: Proceedings of the twelfth international conference on Information and knowledge management, 131-138 12. Oard, D., Gonzalo, J., Sanderson, M., López-Ostenero, F., Wang, J., (2004) Information Retrieval, Vol. 7, Issue 1-2, Pages 205-221 13. Park, L.A.F, Ramamohanarao, K., Palaniswami, M.(2005) A Novel Document Retrieval Method Using the Discrete Wavelet Transform, in ACM Transactions on Information Systems, 23(3) 14. Pirkola, A. & Toivonen, J. & Keskustalo, H. &Visala, K. & Järvelin, K. (2003). Fuzzy Translation of Cross-Lingual Spelling Variants. In proceedings of the 26th ACM SIGIR Conference, pp. 345 – 352 15. Porter, M.F. (1980): An algorithm for suffix stripping, in Program - automated library and information systems,14(3): 130-137 16. Samuelson, P. (1994), Self-Plagiarism or Fair Use?, Communications of the ACM, Vol. 37(8), 21-25. 17. Singhal, A. (1996): Pivoted document length normalization, Proceedings of the 19th ACM SIGIR conference: 21-29 18. • The TREC Spoken Document Retrieval Track : ASuccess Story (Garofolo et al, April 2000).http://www.nist.gov/speech/tests/sdr/sdr2000/papers/01plenary1.pdf 19. Xu, J., Croft, W.B. (1998): Corpus-Based Stemming using Co-occurrence of Word Variants, in ACM Transactions on Information Systems, 16(1): 61-81 20. Wikipedia, http://en.wikipedia.org/ Information Retrieval/ Retrieved on May 15, 2007 Read More

Information Retrieval, Inverse Document Frequency - Coursework Example

Extract of sample "Information Retrieval, Inverse Document Frequency"

CHECK THESE SAMPLES OF Information Retrieval, Inverse Document Frequency

The Importance of Information to Barclays Bank

Crisis Response and Flight Crew Skills

Air Maestro Software Product in Etihad Airlines Company

Computer Assisted Audit Techniques

Work Environment influence on Employees Commitment

Manage Risk: Midlands Hotel

The Internet Has Turned Out to Be an Important Part of Every Business

Innovations in the Society - the Printing Press