Essays on Information Retrieval, Inverse Document Frequency Coursework

Download full paperFile format: .doc, available for editing

The paper "Information Retrieval, Inverse Document Frequency" is an outstanding example of management coursework.   With the development of information technology more and more data is being stored in electronic and other forms. Finding the correct data especially from the electronically stored information is becoming more important by the day. Information research aims at developing models and algorithms for the purpose of information retrieval from document repositories. For effective Information Retrieval, it is necessary to understand how search engines work. Information Retrieval (IR) is defined as the science of searching for information in documents, searching for documents themselves, searching for metadata which describes the document or searching within databases, whether relational stand-alone databases or hypertextually-networked databases such as ‘ World Wide Web’ .(Wikipedia) Process of Retrieval Information Retrieval is the retrieval of unstructured data.

It could be the retrieval of documents or specific information in the documents. It could also be the retrieval of speech or images. When the user needs some information, he converts it into a query as a formal statement and the Information Retrieval system finds the relevant information. Most of the information retrieval is done from texts.

The query formation is based on ‘ bags of words’ , which is a phrase or group of words. Due to the constant growth of text documents, ‘ bags of words’ do not get precision in the results. Synonymous words are one type of challenge. Another challenge is that many times a group of words may have a totally different meaning to the individual words. For example, Hot Dog as a group of words has no similarity to Hot or Dog. Ranked retrieval Ranked retrieval starts with a query and calculates relevance score between the query and every document.

It sorts documents by their score and presents the top-scoring documents to the user. Score computing is done in three stages: - • Quorum scoring • Term frequency (TF) weights • Inverse Document Frequency (IDF) weights For example, if the original query is “ The Amazonian rain forests” , in Case normalization it will be like “ the amazonian rain forests” .

Reference

:

1. Allan, James (2001) “Perspectives on Information Retrieval and Speech,” in Information Retrieval Techniques for Speech Applications, Coden, Brown

and Srinivasan, editors. pp. 1-10.http://ciir.cs.umass.edu/pubfiles/ir-236.pdf

2. Ballesteros, L., Sanderson, M. (2003) Addressing Ballesteros, L., Sanderson, M. (2003) Addressing the lack of direct translation resources for cross language retrieval, in the Proceedings of the 12th international conference on Information and Knowledge Management (CIKM) 147-152

3. Cross Language Retrieval with Triangulated Translation. In the Proceedings of the 24th ACM SIGIR conference, 90-95

4. Enser, P. & Sandom, C. (2002). Retrieval of archival moving imagery: CBIR outside the frame? Proceedings of the International Conference on Image and Video Retrieval(London, July 18-19, 2002). Berlin: Springer (Lecture Notes in Computer Science 2383). 206-214

5. Harman, D. (1992): Ranking algorithms, in Frakes, W. & Baeza-Yates, B. (eds.),

Information Retrieval: Data Structures & Algorithms: 363-392

6. Krovetz, R. (1993): Viewing morphology as an inference process, in Proceedings of the 16th ACM SIGIR conference: 191-202

7. Hammarström H. (2006) Poor Man's Stemming: Unsupervised Recognition of Same-Stem Words. Proceedings of the Third Asia Information retrieval Symposium, AIRS 2006, pp. 323-333

8. K. Collins-Thompson and J. Callan. (2004.) A language modeling approach to predicting reading difficulty. In Proceedings of the HLT/NAACL 2004 Conference. Boston.

9. Lyon, C., Malcolm, J. and Dickerson, B. (2001), Detecting Short Passages of Similar Text in Large Document Collections, In Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing, 118-125.

10. Martin, B. (1994), Plagiarism: a misplaced emphasis, Journal of Information Ethics, Vol. 3(2), 36-47.

11. Melucci, M., Orio, N. (2003) A novel method for stemmer generation based on hidden markov models. In: CIKM’03: Proceedings of the twelfth international conference on Information and knowledge management, 131-138

12. Oard, D., Gonzalo, J., Sanderson, M., López-Ostenero, F., Wang, J., (2004) Information Retrieval, Vol. 7, Issue 1-2, Pages 205-221

13. Park, L.A.F, Ramamohanarao, K., Palaniswami, M.(2005) A Novel Document Retrieval Method Using the Discrete Wavelet Transform, in ACM Transactions on Information Systems, 23(3)

14. Pirkola, A. & Toivonen, J. & Keskustalo, H. &Visala, K. & Järvelin, K. (2003). Fuzzy Translation of Cross-Lingual Spelling Variants. In proceedings of the 26th ACM SIGIR Conference, pp. 345 – 352

15. Porter, M.F. (1980): An algorithm for suffix stripping, in Program - automated library and information systems,14(3): 130-137

16. Samuelson, P. (1994), Self-Plagiarism or Fair Use?, Communications of the ACM, Vol. 37(8), 21-25.

17. Singhal, A. (1996): Pivoted document length normalization, Proceedings of the 19th ACM SIGIR conference: 21-29

18. • The TREC Spoken Document Retrieval Track : ASuccess Story (Garofolo et al, April 2000).http://www.nist.gov/speech/tests/sdr/sdr2000/papers/01plenary1.pdf

19.

Xu, J., Croft, W.B. (1998): Corpus-Based Stemming using Co-occurrence of Word Variants, in ACM Transactions on Information Systems, 16(1): 61-81

20. Wikipedia, http://en.wikipedia.org/ Information Retrieval/ Retrieved on May 15, 2007

Download full paperFile format: .doc, available for editing
Contact Us