Strategic enhancement of the collaborative framework for novelty in retrieval from digital textual data corpus by deploying DPSC and RBWM algorithms for forensic analysis

  • Gowri Shanmugam Sathyabama University
  • Shanmugam Mala Ganapathy Sankar
Keywords: Data management, document clustering, Google’s Crawler, preprocessing, semantic.

Abstract

This paper proposes two advanced algorithms embedded into an integrated system; one is a Dynamic Path Selection Clustering (DPSC) algorithm for the document clustering and the other is the Rearward Binary Window Match (RBWM) algorithm for the user’s search engine. The DPSC algorithm is derived from the concept of Google’s crawler technique implemented in offline processing and the RBWM algorithm for search engine is derived by utilizing the techniques of other search algorithms. The proposed system is being accomplished for giving an appropriate data structure to the input dataset content. The dataset used as input is the Enron dataset, which is large in volume and unstructured. The system is designed with the help of integrating all the individual and independent units into a system by bringing them under one frame and the units are data preprocessing, document clustering, mapping of clusters and search engine. This system, with fine refining integrated frame, would likely evidence in a better way, since simple definition of the system for data retrieval affects the consistency of irrelevant information retrieval for evidencing to be increased. Though there are plenty of existing systems in forensic department with only simple definition of search engines, without any other processes the irrelevancy in retrieval is seen to a larger extent. Consequently, a design of this integrated system, which is automated in process by using the above well defined configured units, is proposed. This systematic approach is for adequate use of digital textual evidences, which assists in quicker crime identification rate. The outcomes of the proposed system are analyzed by obtaining the precision and recall values and comparing them with the results of Metasearch engines like Dogpile and Metacrawler, to test the efficacy in retrieval rate.

 

References

Anthony McGregor, Mark Hall, Perry Lorier and James Brunskill, “Flow clustering using machine learning techniques,” in Proc. of 5th Int. Workshop on Passive and Active Network Measurement, pp.205-214, June 2-7, 2004.

Sebastian Zander, Thuy Nguyen and Grenville Armitage, “Self-learning IP traffic classification based on statistical flow characteristics,” in Proc. of 6th Int. Workshop on Passive and Active Measurement, pp.325-328, March 23-27, 2005.

Mohamed Benrabh, Abdelaziz Bouroumi and Abdellatif Hamdoun, “A Fuzzy Validity-Guided Procedure for Cluster Detection” Malaysian Journal of Computer Science, Vol. 18 No. 1, pp. 31-39, June 2005.

D. S. Tasić, M. S. Stojanović, “Modified Fuzzy Clustering Method for Energy Loss Calculations in Low Voltage Distribution Networks,” in ELEKTRONIKA IR ELEKTROTECHNIKA, pp.50-55, Nr. 2(66) 2006.

Rudi L. Cilibrasi and Paul M.B. Vitányi, “The Google Similarity Distance,” in IEEE Transactions on knowledge and Data Engineerin, Vol.19 No. 3, March 2007

Nicole Lang Beebe and Jan Guynes Clark, “Digital forensic text string searching: Improving information retrieval effectiveness by thematically clustering search results,” The International Journal of Digital Forensics & Incident Response, Volume 4, Supplement, Pages 49-54, September 2007.

B.T. Sampath kumar and S.M. Pavithra, “Evaluating the searching capabilities of search engines and metasearch engines: a comparative study,” Annals of Library and Information Studies, Vol. 570, pp. 87-97, June 2010.

Cong-dao Han, Ji-lin Liu and Zhi-yu Xiang, “An adaptive fast search algorithm for block motion estimation in H.264,” in Journal of Zhejiang University SCIENCE C, Vol.11 No.8 pp. 637-644, August 2, 2010.

Ya-li Cao, Tie-jun Huang and Yong-hong Tian, “A ranking SVM based fusion model for cross-media meta-search engine,” in Journal of Zhejiang University SCIENCE C, Vol.11 No.11 pp. 903-910, November 4, 2010.

Hong-xia Pang, Wen-de Dong, Zhi-hai Xu, Hua-jun Feng, Qi Li and Yue-ting Chen, “Novel linear search for support vector machine parameter selection,” in Journal of Zhejiang University SCIENCE C, Vol.12 No.11 pp. 885-896, November 4, 2011.

Lim Bee Huang, Vimala Balakrishnan and Ram Gopal Raj, “Improving the Relevancy Of Document Search Using the Multi-Term Adjacency Keyword-Order Model”, Malaysian Journal of Computer Science, Vol. 25(1), pp. 1-10, 2012.

Manish Joshi, Pawan Lingras and C.Raghavendra Rao, “Correlating Fuzzy and Rough Clustering”, Fundamenta Informaticae 115, pp.233-246, 2012.

Suiang-Shyan Lee and Ja-Chen Lin, “An accelerated K-means clustering algorithm using selection and erasure rules,” in Journal of Zhejiang University SCIENCE C, Vol.13 No.10 pp. 761-768, October 10, 2012.

Nam-Su Jho and Dowon Hong, “Symmetric Searchable Encryption with Efficient Conjunctive Keyword Search,” KSII Transactions on Internet and Information Systems (TIIS), Vol.7 No.5, pp. 1328 - 1342, May 31, 2013.

S.Sendilkumar, B.L.Mathur, Mohammed Imran, “Discrimination of Power Transformation inrush and internal Fault Current using Time to Time Transformation and Fault Classification using Fuzzy Clustering”, Journal of Engg. Research, Vol. 1(3), pp. 87-108, December 2013.

ÁlvaroCuesta, David F.Barrero, María D. R-Moreno, “A Framework for Massive Twitter Data Extraction and Analysis” Malaysian Journal of Computer Science. Vol. 27(1), pp. 50-67, 2014

S.Gowri, Dr.G.S.Anandha Mala and Divya.G, “Text Preprocessing for the improvement of Information Retrieval in Digital Textual Analysis,” International Conference on Mathematical Science-ICMS, pp.174-179, 2014.

Xian Zang, Felipe P. Vista Iv and Kil To Chong, “Fast global kernel fuzzy c-means clustering algorithm for consonant/vowel segmentation of speech signal,” in Journal of Zhejiang University SCIENCE C, Vol.15 No.7 pp. 551-563, July 10, 2014.

Hong Wang and Rongfang Song, “Clustering Based Adaptive Power Control for Interference Mitigation in Two-Tier Femtocell Networks,” in KSII Transactions on Internet and Information Systems (TIIS), vol. 8, no. 4, pp. 1424-1441, April 29, 2014.

Wei Kuang Lai, Chung-Shuo Fan and Chin-Shiuh Shieh, “Efficient Cluster Radius and Transmission Ranges in Corona-based Wireless Sensor Networks,” in KSII Transactions on Internet and Information Systems (TIIS), Vol.8 No.4, pp.1237-1255, April 29, 2014.

Seung Ryul Jeong and Imran Ghani, “Semantic Computing for Big Data: Approaches, Tools, and Emerging Directions (2011-2014),” in KSII Transactions on Internet and Information Systems (TIIS), Vol.8 No.6, pp. 2022 - 2042, June 27, 2014.

S.Gowri, Dr.G.S.Anandha Mala and Divya.G, “Enhancing the Digital Data Retrieval System Using Novel Techniques,” Journal of Theoretical and Applied Information Technology, Vol. 66, Aug 2014.

Enron dataset- http://www.cs.cmu.edu/~enron/

Google Crawlers- https://support.google.com/webmasters/answer/1061943?hl=en

Crawling & Indexing-http://www.google.com/intl/en/insidesearch/howsearchworks/crawling-indexing.html

Web crawler- http://en.wikipedia.org/wiki/Web_crawler

Published
2015-12-03
Section
Computer Engineering