Big Data Indexing: Taxonomy, Performance Evaluation, Challenges and Research Opportunities

Abubakar Usman Othman(1), Timothy Moses(2), Umar Yahaya Aisha(3), Abdulsalam Ya’u Gital(4), Boukari Souley(5), Badmos Tajudeen Adeleke(6),


(1) Federal University Lafia
(2) Abubakar Tafawa Balewa University
(3) Gombe State University
(4) Abubakar Tafawa Balewa University
(5) Abubakar Tafawa Balewa University
(6) Industry and Innovation Institute, Sheffield Hallam University

Abstract


In order to efficiently retrieve information from highly huge and complicated datasets with dispersed storage in cloud computing, indexing methods are continually used on big data. Big data has grown quickly due to the accessibility of internet connection, mobile devices like smartphones and tablets, body-sensor devices, and cloud applications. Big data indexing has a variety of problems as a result of the expansion of big data, which is seen in the healthcare industry, manufacturing, sciences, commerce, social networks, and agriculture. Due to their high storage and processing requirements, current indexing approaches fall short of meeting the needs of large data in cloud computing. To fulfil the indexing requirements for large data, an effective index strategy is necessary. This paper presents the state-of-the-art indexing techniques for big data currently being proposed, identifies the problems these techniques and big data are currently facing, and outlines some future directions for research on big data indexing in cloud computing. It also compares the performance taxonomy of these techniques based on mean average precision and precision-recall rate.


Keywords


Indexing; Similarity search; Matching; Big data; Cloud Computing

Full Text:

PDF

References


Thilkanathan, Danan, S. C., Surya, N., Rafael, C., & Leila, A. (2014). A platform for monitoring and sharing of generic health data in the cloud. Future generation computer system, 35, 102-113. Retrieved April 9, 2017. https://doi.org/10.1016/j.future.2013.09.011

Huang, Z., Heng, T. S., & Shao, J. (2010). Bounded Coordinate System Indexing for Real-time. ACM Transactions on Information Systems, 10(10), 1-32. https://doi.org/10.1145/1508850.1508855

Gartner M, Rauber, A., & Berger, H. (2013). Briging structured and unstructured data via hybrid semantic search and interactive ontology-enhanced query formulation. Knowledge information system, 1-32. https://doi.org/10.1007/s10115-013-0678-y

Armbrust, M., Fox , A., Griffith, R., Joseph, A. D., Katz, H. R., & Ko, A. (2009, 2). Above the clouds: Berkeley view of cloud computing, Technical Report UCB/EECS.Materials Genome initiative for Global Competitiveness. https://inst.eecs.berkeley.edu/~cs10/fa10/lec/20/2010-11-10-CS10-L20-AF-Cloud-Computing.pdf

Agrawal, D., Bernstein, P., Bertino, E., Davidson, S.,& Dayal, U. (2012). Challenges and Opportunities with Big Data. A white paper prepared for the Computing Community Consortium, 1-16. https://cra.org/ccc/wp-content/uploads/sites/2/2015/05/bigdatawhitepaper.pdf

Chang, c., Kayed, M., Girgis, M. R., & Shaalam, K. F. (2006). A survey of web information extraction system. IEEE Transaction on Knowledge and Data Engineering, 18(10), 1411-1428. https://doi.org/10.1109/TKDE.2006.152

Agrawa, C. C., & Wang, H. (2010). Managing and mining graph data. Springer publishing company. https://doi.org/10.1007/978-1-4419-6045-0

Hosagrahar, V. J., Beng, C., Kian-Lee, T., Cui, Y., & Rui, Z. (2005). iDistance: An adaptive B+-tree based indexing method for nearest neighbour search. ACM Transaction on Database Systems, 30(2), 364-397. https://doi.org/10.1145/1071610.1071612

Muja, M., & Lowe, D. G. (2009). Fast approximate nearest neighbours with automatic algorithm configuration. VISAPP, 331-340.

Jon, L. B. (1975). Multidimensional Binary Search Trees Used for Associative Searching. ACM, 18(9), 509-517. https://doi.org/10.1145/361002.361007

Chanop, S.-A., & Richard, H. (2008). Optimised KD-trees for fast image descriptor matching. IEEE Conference on Computer Vision and Pattern Recognition, 1-8. https://doi.org/10.1109/CVPR.2008.4587638

Shamshirband, S., Anuar, N., Kiah, M., & Patel, A. (2013). An appraisal and design of a multi-agent system based cooperative wireless intrusion detection computational intelligence technique. Engineering Application of Artificial Intelligent. , 26(8), 2105-2127. https://doi.org/10.1016/j.engappai.2013.04.010

Aguilera M, K., Golab, W., & Shah, M. A. (2008). A practical scalable distributed b-tree. Proceedings of the VLDB Endowment, 1(1), 598–609. https://doi.org/10.14778/1453856.1453922

Jaluta, I. (2014, April). Transaction management in b-tree-indexed database systems. In Information Science, Electronics and Electrical Engineering. International conference on, 3, 1968-1975. https://doi.org/10.1109/InfoSEEE.2014.6946267

Frome, A., Singer, Y., Sha, F., & Malik, J. (2008, June). Learning globally-consistent local. ICCV. https://www2.eecs.berkeley.edu/Research/Projects/CS/vision/shape/papers/FromeSingerShaMalikICCV07.pdf

Friedman, J., Bentley, J. L., & Finkel, R. A. (1977). An algorithm for finding best matches in logarithmic expected time. ACM Transaction on Mathematical Software, 3(3), 209-226. https://doi.org/10.1145/355744.355745

Kai, Z. Y., Nicholas, J. Y., & Shuo, S. (2013). Discovering of gathering patterns from trajectories. ICDE. https://doi.org/10.1109/ICDE.2013.6544829

Hoyoung, J., Man, L. Y., Xiaofang, Z., Christian, S. J.,& Heng, T. S. (2008). Discovery of convoys in trajectory databases. VLDBJ. https://doi.org/10.14778/1453856.1453971

Yu, Y., Zhu, Y., Ng, W., & Samsudin, J. (2014, 12). An efficient multidimension metadata index and search system for cloud data,” in Cloud Computing technology and science. IEEE Transaction on, 499-504. https://doi.org/10.1109/CloudCom.2014.88

Dieter, P., Christian, S. J., & Yanmis, T. (2000). Novel approaches to the indexing of moving objects trajectories. VDLB. https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.23.3875&rep=rep1&type=pdf

Lei, C., Tammer, M. O., & Vincent, O. (2005). Robust and fast similarity search for moving object trajectories. ICDE. https://doi.org/10.1145/1066157.1066213

Michail, V., George, K., & Dimitrios. (2002). Discovering similar multidimensional trajectories. ICDE. https://doi.org/10.1109/ICDE.2002.994784

Prateek Jain, B. K. (2008, 6). Fast image search for learned metrics. In proceeding of the IEEE conference on computer vision and pattern recognition. https://doi.org/10.1109/CVPR.2008.4587841

Xu, H., Wang, J., Li, Z., Zeng, G., Li, S., & Yu, N. (2011). Complementary hashing for approximate nearest neighbor search. In Proc. ICCV. https://doi.org/10.1109/ICCV.2011.6126424

Kulis, B., Jain, P., & Grauman, K. (2009). Fast similarity search for learned metrics. TPAMI, 31(12), 2143–2157. https://doi.org/10.1109/TPAMI.2009.151

Torralba, A., Fergus, R., & Freeman, W. T. (2008). 80 million tiny images: a large dataset for non-parametric object and scene recognition. TPAMI, 30(11), 1958–1970,. https://doi.org/10.1109/TPAMI.2008.128

Strecha, C, A. M., M, M. B., & P, F. (2012). Ldahash: Improved matching with smaller descriptors. TPAMI, 34(1), 66-76. https://doi.org/10.1109/TPAMI.2011.103

Zhu, X., Huang, Z., Cheng, H., Cui, J., & Shen, H. T. (2013). Sparse hashing for fast multimedia search. ACM Transaction on information system, 3(2), 1-24. https://doi.org/10.1145/2457465.2457469

Avidan, S., & Korman, S. (2011). Coherency sensitive hashing. In Proceedings of ICCV. https://doi.org/10.1109/ICCV.2011.6126421

Datar, M., Immorlica, N., Indyk, P., & Mirrokni, P. (2004). Locality sensitive hashing scheme based on p-stable distributions. In Proceedings of the Symposium on Computational Geometry, 253–262. https://doi.org/10.1145/997817.997857

Zhou, A. (2005). c^2: a new overlay network based on can and chord. international journal of high performance computing network, 3(4), 248-261. https://doi.org/10.1007/978-3-540-24679-4_15

W. Liu, J. Wang, R. J. Y-G. Jang, S-F Chang., (2012). Supervised hashing with kernels. In computer vision and pattern recognition. https://doi.org/10.1109/CVPR.2012.6247912

A. Jolly & O. Buisson. (2011). Random maximum margin hashing. CVPR. https://doi.org/10.1109/CVPR.2011.5995709

H. Jae-Pil, L. Youngwoon, H. Junfeng, C. Shih-Fu, Y. Sung_Eui, (2015). Spherical Hashing: Binary Code Embedding with Hyperspheres. IEEE transaction on Pattern Analysis and Machine Intelligent, 1-14. https://doi.org/10.1109/TPAMI.2015.2408363

J. He, R. Rhadhakrishnan, S-F Chang and C. Bauer. (2011). Compact hashing with joint optimisation of search accuracy and time. CVPR. https://doi.org/10.1109/CVPR.2011.5995518

Y. Gong and S. Lazebnik. (2011). Itetrative Quantisation: a procrustean approach to learning binary codes for large-scale image retrieval. IEEE transaction on pattern analysis and machine inteliigence. https://doi.org/10.1109/TPAMI.2012.193

A. Torralba, R. fergus, and Y. Weiss, (2008). Small codes and large image dtatabases for recognition. CVPR. https://doi.org/10.1109/CVPR.2008.4587633

Y. Weiss, A. Torralba, and R. fergus, . (2008). Spectral Hashing. in proceedings of NIPS.

O. Chum, J. Philbin, A. Zisseman, (2008). Near duplicate image detectionmin-hash and tf-idf weighting. BMVC.

M. Rangisky and S. Lazebnik. (2009). Locality sensitive binary codes from shift-invariant kernels. in proceedings od NIPS, 1509-1517.

R. Salakhutdinov, G. Hinton, (2009). Semantic Hashing. International Journal of Approximate reasoning. https://doi.org/10.1016/j.ijar.2008.11.006

Boukari Souley, A. U. Othman (2019). Geometric Similarity Preserving Embedding-Based Hashing for Bid Data in Clou Computing. International Journal of research and Scientific Innovation.

J. Wang, S. Kumar, S-F Chang,. (2010). Sequential projection learning for hashing with compact codes. ICML.

L. Pauleve, H. Jegou, L. Amsaleg, (2010). Locality sensitive Hashing: A comparison of hash function types and queryong mechanism. Pattern recognition Letters. https://doi.org/10.1016/j.patrec.2010.04.004

J. Wang, S. Kumar, and S-F Chang, (2010). Semi-supervised hashing for scalable image retrieval. CVPR. https://doi.org/10.1109/CVPR.2010.5539994

R-S Lin, D. Rose, J. Yangik, (2010). Spec Hashing: Similarity preserving algorithm for entropy-base coding. CVPR. https://doi.org/10.1109/CVPR.2010.5540129

R. Ye, Z. Li, (2016). Compact structure hashing via sparse and similarity preserving embedding. IEEE transaction on cybernatics, 46(3), 718-729. https://doi.org/10.1109/TCYB.2015.2414299

H. Zhang, L. Liu, Y. Yong, L. Shao, (2017). Unsupervised deep hashing with pseudo labels for scalable image retrieeval. https://doi.org/10.1109/TIP. 2017. 2781422

Y. Lv, W. Y. Ng Wing, Z. Zeng, S. D. Yeung,and P. K. Patrick (2015). Asymmetric Cyslical Hashing for Large Scale Image Retrieval. IEEE transaction on multimedia, 11(8), 1225-1235. https://doi.org/10.1109/TMM.2015.2437712

M. Norouzi and D. J. Fleet. (2011). Minimal Hashing for Compact binary codes. ICML.

Kadiyala S, S. N. (2008). A compact multi-resolution indedx for variable length queries in time series database. Knowledge information system, 15(2), 131-147. https://doi.org/10.1007/s10115-007-0097-z

Meshram, B. B., & Gaikwad, G. P. (2013, 4). Different indexing techniques. International Journal of Engineering Research and Application, 3(2), 1230-1235.

Chen, J., Yuegue, C., Lia, E., Cuiping, I. L., & Jiaheng, U. L. (2013). Big Data Challenges: A data Management Perspective. Higher education press and springer verlag Berlin Heidelberg, 7(2), 157-164. http://dx.doi.org/10.1007/s11704-013-3903-7

Kaisler, S.,Armour, F., Espinosa , J. A., & Money, W. (2013). Big data: issues and challenges moving forward. Hawii international conference on system sciences, 995-1004. http://dx.doi.org/10.1109/HICSS.2013.645

M. S. Charkar, (2002). Similarity estimation techniques from rounding algorithms. In Proceedings of Annual ACM Symposium on Theory of Computation, 380-388. https://doi.org/10.1145/509907.509965

B. Kulis, K. Grauman, (2009, September-October). Kernelised Locality-sensitive hashing for scalable image search. In Proceedings of IEEE conference on computer vision and pattern recognition, 2130-2137. http://dx.doi.org/10.1109/ICCV.2009.5459466

B. Souley, A. U. Othman, A. Y. Gital and I. M. Adamu, (2019). Performance evaluation of GSPEBH for big data in cloud computing. Global Scientific Journal.

C. Yan, H. Xie, D. Yang, J. Yin, Y. Zhang, (2018). Supervised hash coding with deep neural network for environment perception of intelligent vehicles. IEEE transaction on intelligent transportation systems, 19(1), 284-295. https://doi.org/10.1109/TITS.2017.2749965

M. He, Y. Yang, F. Shen, N. Xie, and H. T. Shen, (2017). Hashing with Angular Reconstruction Embeddings. IEEE Transactions on Image Processing. 27(5), 545-555. https://doi.org/10.1109/TIP.2017.2749147

Nussinov, R., & Wolfson, H. J. (1991, 12 01). Efficient Detection of three-Dimensional Structural Motifs in Biological Macromolecules by Computer Vision techniques. Peoceedings of the National Academy of Science America, 88(23), 10495-10499. https://doi.org/10.1073%2Fpnas.88.23.10495

Mehrotra, H., Majhi, B., & Gupta, P. (2010). Robust ris indexing scheme using geometric hashing of SIFT keypoints. Journal of Network and Computer Applications, 33, 300–313. https://doi.org/10.1016/j.jnca.2009.12.005

Lowe, D. (2004). Distinctive image features from scale-invariant key points. International Journal of Computer Vision 60, 91–110 (2004). https://doi.org/10.1023/B:VISI.0000029664.99615.94.

Jayaraman, U., Surya, P., & Phalguni, G. (2013). Use of geometric features of principal components for indexing a biometric database. Mathematical aand Computing Modelling, 58, 147-164. https://doi.org/10.1016/j.mcm.2012.06.005

Li, F., Yi, K., & Le, W. (2010). Top-k queries on temporal data. VLDB, 19(5), 715-733. http://dx.doi.org/10.1007/s00778-010-0186-6

Sandu P, I., Zeitouni, K., Oria, V., Barth , D., & Vial, S. (2011). Indexing in network trajectory flows. VLDBJ, 20(5), 643-669. https://doi.org/10.1007/s00778-011-0236-8

Dittrichter. (2011). MOVIES: Indexing moving objects by shooting index images. Geoinformatics, 15(4), 727-767. http://dx.doi.org/10.1007/s10707-011-0122-y

Xie, M., Wang, H., Yin, J., & Meng, X. (2007). Integrity auditing of outsourced data. In Proceedings of the International Conference on Very Large Databases, 782-793.

Vandana, D. K., Jayaraman, U., Amit, K., Aman, K. G., & Gupta, P. (2013). An efficient indexing scheme for face database using modified geometric hashing. Neurocomputing, 116, 208-221. https://doi.org/10.1016/j.neucom.2011.12.056

Umarani, J., Surya, P., & Phalguni, G. (2013). Use of geimetric features of principal components for indexing a biometric database. Mathematical and computing modelling, 58, 147-164. https://doi.org/10.1016/j.mcm.2012.06.005

Ling-Yin, Y.-T. H.-C.-C. (2013). Indexing spatial data in cloud data management. Pervasive mobile computing, 1-14. https://doi.org/10.1016/j.pmcj.2013.07.001

Xiaohui, Yu, K. Q., & Nick, K. (2005). Monitoring K-nearest neighbour queries over moving objects. In Proceedings of the 21st International Conference on Data Engineering. https://doi.org/10.1109/ICDE.2005.92

Wang, M., Viliam, H., John, M., & Patrick, O. (2013, 2). High volumes of event stream indexing and efficient multi-keyword searching for cloud monitoring. Future generation computer, 29, 1943-1962. https://doi.org/10.1016/j.future.2013.04.028

Wang, J., Wu, S., Gao, H., Li, J., & Ooi, B. C. (2010). Indexing Multidimensional Data in a Cloud System. ACM SIGMOD International conference on management of data, 591-602. https://doi.org/10.1145/1807167.1807232

Spek, P. V., & Steven, K. (2011). Applying a dynamic threshold to improve cluster detection LSI. Science of computer programming, 76, 1261-1274. https://doi.org/10.1016/j.scico.2010.12.004

James, C., Yiping, K., Ada, W.-C. F., & Jeffrey, X. Y. (2011). Fast grapbh query processing with low-cost index. The VLDB journal, 20(4), 521-539. https://doi.org/10.1007/s00778-010-0212-8

Giangreco, I., Kabary, I. A., & Schuldt, H. (2014, 06). Adam - a database and information retrieval system for big multimedia collections. IEEE International Conference on, 406-413. https://doi.org/10.1109/BigData.Congress.2014.66

Collins, E. (2014). Intersection of the cloud and big data. IEE Cloud Computing 1, 84-85. http://dx.doi.org/10.1109/MCC.2014.12

Cackett, D. (2013). Information Management and Big data: A reference Architecture. Oracle corperation.

Wook-shin, H., Jinsoo, L., & Minh-Duc, P. (2010). iGraph: A framework for comparing a disk-based grapph indexing techniques. Proceedings of the VCLD endowment, 3(1).

Jin Z, L. C., Lin, Y., & Cai, D. (2014, august). Density Sensitive Hashing. IEEE transactions on Cybernetics, 44(8), 1362-1371. https://doi.org/10.1109/TCYB.2013.2283497

Ferragina, P., & Rossano, V. (2007, 7). The ompressed permuterm index. In proceedings of SIGIR, 535-542. https://doi.org/10.1145/1868237.1868248

JJinbao, W., Wu, S., Gao, J., Li, J., & Ooi, C. B. (2010). Indexing multi-dimensional data in a cloud system. SIGMOD. https://doi.org/10.1145/1807167.1807232


Refbacks

  • There are currently no refbacks.


Journal of Computer Science and Engineering (JCSE)
ISSN 2721-0251 (online)
Published by : ICSE (Institute of Computer Sciences and Engineering)
Website : http://icsejournal.com/index.php/JCSE/
Email: jcse@icsejournal.com

Creative Commons License is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.