Big Data Indexing: Taxonomy, Performance Evaluation, Challenges and Research Opportunities

Abubakar Usman Othman, Timothy Moses, Umar Yahaya Aisha, Abdulsalam Ya’u Gital, Boukari Souley, Badmos Tajudeen Adeleke


In order to efficiently retrieve information from highly huge and complicated datasets with dispersed storage in cloud computing, indexing methods are continually used on big data. Big data has grown quickly due to the accessibility of internet connection, mobile devices like smartphones and tablets, body-sensor devices, and cloud applications. Big data indexing has a variety of problems as a result of the expansion of big data, which is seen in the healthcare industry, manufacturing, sciences, commerce, social networks, and agriculture. Due to their high storage and processing requirements, current indexing approaches fall short of meeting the needs of large data in cloud computing. To fulfil the indexing requirements for large data, an effective index strategy is necessary. This paper presents the state-of-the-art indexing techniques for big data currently being proposed, identifies the problems these techniques and big data are currently facing, and outlines some future directions for research on big data indexing in cloud computing. It also compares the performance taxonomy of these techniques based on mean average precision and precision-recall rate.


Indexing; Similarity search; Matching; Big data; Cloud Computing

Full Text:



Thilkanathan, Danan, S. C., Surya, N., Rafael, C., & Leila, A. (2014). A platform for monitoring and sharing of generic health data in the cloud. Future generation computer system, 35, 102-113. Retrieved April 9, 2017.

Huang, Z., Heng, T. S., & Shao, J. (2010). Bounded Coordinate System Indexing for Real-time. ACM Transactions on Information Systems, 10(10), 1-32.

Gartner M, Rauber, A., & Berger, H. (2013). Briging structured and unstructured data via hybrid semantic search and interactive ontology-enhanced query formulation. Knowledge information system, 1-32.

Armbrust, M., Fox , A., Griffith, R., Joseph, A. D., Katz, H. R., & Ko, A. (2009, 2). Above the clouds: Berkeley view of cloud computing, Technical Report UCB/EECS.Materials Genome initiative for Global Competitiveness.

Agrawal, D., Bernstein, P., Bertino, E., Davidson, S.,& Dayal, U. (2012). Challenges and Opportunities with Big Data. A white paper prepared for the Computing Community Consortium, 1-16.

Chang, c., Kayed, M., Girgis, M. R., & Shaalam, K. F. (2006). A survey of web information extraction system. IEEE Transaction on Knowledge and Data Engineering, 18(10), 1411-1428.

Agrawa, C. C., & Wang, H. (2010). Managing and mining graph data. Springer publishing company.

Hosagrahar, V. J., Beng, C., Kian-Lee, T., Cui, Y., & Rui, Z. (2005). iDistance: An adaptive B+-tree based indexing method for nearest neighbour search. ACM Transaction on Database Systems, 30(2), 364-397.

Muja, M., & Lowe, D. G. (2009). Fast approximate nearest neighbours with automatic algorithm configuration. VISAPP, 331-340.

Jon, L. B. (1975). Multidimensional Binary Search Trees Used for Associative Searching. ACM, 18(9), 509-517.

Chanop, S.-A., & Richard, H. (2008). Optimised KD-trees for fast image descriptor matching. IEEE Conference on Computer Vision and Pattern Recognition, 1-8.

Shamshirband, S., Anuar, N., Kiah, M., & Patel, A. (2013). An appraisal and design of a multi-agent system based cooperative wireless intrusion detection computational intelligence technique. Engineering Application of Artificial Intelligent. , 26(8), 2105-2127.

Aguilera M, K., Golab, W., & Shah, M. A. (2008). A practical scalable distributed b-tree. Proceedings of the VLDB Endowment, 1(1), 598–609.

Jaluta, I. (2014, April). Transaction management in b-tree-indexed database systems. In Information Science, Electronics and Electrical Engineering. International conference on, 3, 1968-1975.

Frome, A., Singer, Y., Sha, F., & Malik, J. (2008, June). Learning globally-consistent local. ICCV.

Friedman, J., Bentley, J. L., & Finkel, R. A. (1977). An algorithm for finding best matches in logarithmic expected time. ACM Transaction on Mathematical Software, 3(3), 209-226.

Kai, Z. Y., Nicholas, J. Y., & Shuo, S. (2013). Discovering of gathering patterns from trajectories. ICDE.

Hoyoung, J., Man, L. Y., Xiaofang, Z., Christian, S. J.,& Heng, T. S. (2008). Discovery of convoys in trajectory databases. VLDBJ.

Yu, Y., Zhu, Y., Ng, W., & Samsudin, J. (2014, 12). An efficient multidimension metadata index and search system for cloud data,” in Cloud Computing technology and science. IEEE Transaction on, 499-504.

Dieter, P., Christian, S. J., & Yanmis, T. (2000). Novel approaches to the indexing of moving objects trajectories. VDLB.

Lei, C., Tammer, M. O., & Vincent, O. (2005). Robust and fast similarity search for moving object trajectories. ICDE.

Michail, V., George, K., & Dimitrios. (2002). Discovering similar multidimensional trajectories. ICDE.

Prateek Jain, B. K. (2008, 6). Fast image search for learned metrics. In proceeding of the IEEE conference on computer vision and pattern recognition.

Xu, H., Wang, J., Li, Z., Zeng, G., Li, S., & Yu, N. (2011). Complementary hashing for approximate nearest neighbor search. In Proc. ICCV.

Kulis, B., Jain, P., & Grauman, K. (2009). Fast similarity search for learned metrics. TPAMI, 31(12), 2143–2157.

Torralba, A., Fergus, R., & Freeman, W. T. (2008). 80 million tiny images: a large dataset for non-parametric object and scene recognition. TPAMI, 30(11), 1958–1970,.

Strecha, C, A. M., M, M. B., & P, F. (2012). Ldahash: Improved matching with smaller descriptors. TPAMI, 34(1), 66-76.

Zhu, X., Huang, Z., Cheng, H., Cui, J., & Shen, H. T. (2013). Sparse hashing for fast multimedia search. ACM Transaction on information system, 3(2), 1-24.

Avidan, S., & Korman, S. (2011). Coherency sensitive hashing. In Proceedings of ICCV.

Datar, M., Immorlica, N., Indyk, P., & Mirrokni, P. (2004). Locality sensitive hashing scheme based on p-stable distributions. In Proceedings of the Symposium on Computational Geometry, 253–262.

Zhou, A. (2005). c^2: a new overlay network based on can and chord. international journal of high performance computing network, 3(4), 248-261.

W. Liu, J. Wang, R. J. Y-G. Jang, S-F Chang., (2012). Supervised hashing with kernels. In computer vision and pattern recognition.

A. Jolly & O. Buisson. (2011). Random maximum margin hashing. CVPR.

H. Jae-Pil, L. Youngwoon, H. Junfeng, C. Shih-Fu, Y. Sung_Eui, (2015). Spherical Hashing: Binary Code Embedding with Hyperspheres. IEEE transaction on Pattern Analysis and Machine Intelligent, 1-14.

J. He, R. Rhadhakrishnan, S-F Chang and C. Bauer. (2011). Compact hashing with joint optimisation of search accuracy and time. CVPR.

Y. Gong and S. Lazebnik. (2011). Itetrative Quantisation: a procrustean approach to learning binary codes for large-scale image retrieval. IEEE transaction on pattern analysis and machine inteliigence.

A. Torralba, R. fergus, and Y. Weiss, (2008). Small codes and large image dtatabases for recognition. CVPR.

Y. Weiss, A. Torralba, and R. fergus, . (2008). Spectral Hashing. in proceedings of NIPS.

O. Chum, J. Philbin, A. Zisseman, (2008). Near duplicate image detectionmin-hash and tf-idf weighting. BMVC.

M. Rangisky and S. Lazebnik. (2009). Locality sensitive binary codes from shift-invariant kernels. in proceedings od NIPS, 1509-1517.

R. Salakhutdinov, G. Hinton, (2009). Semantic Hashing. International Journal of Approximate reasoning.

Boukari Souley, A. U. Othman (2019). Geometric Similarity Preserving Embedding-Based Hashing for Bid Data in Clou Computing. International Journal of research and Scientific Innovation.

J. Wang, S. Kumar, S-F Chang,. (2010). Sequential projection learning for hashing with compact codes. ICML.

L. Pauleve, H. Jegou, L. Amsaleg, (2010). Locality sensitive Hashing: A comparison of hash function types and queryong mechanism. Pattern recognition Letters.

J. Wang, S. Kumar, and S-F Chang, (2010). Semi-supervised hashing for scalable image retrieval. CVPR.

R-S Lin, D. Rose, J. Yangik, (2010). Spec Hashing: Similarity preserving algorithm for entropy-base coding. CVPR.

R. Ye, Z. Li, (2016). Compact structure hashing via sparse and similarity preserving embedding. IEEE transaction on cybernatics, 46(3), 718-729.

H. Zhang, L. Liu, Y. Yong, L. Shao, (2017). Unsupervised deep hashing with pseudo labels for scalable image retrieeval. 2017. 2781422

Y. Lv, W. Y. Ng Wing, Z. Zeng, S. D. Yeung,and P. K. Patrick (2015). Asymmetric Cyslical Hashing for Large Scale Image Retrieval. IEEE transaction on multimedia, 11(8), 1225-1235.

M. Norouzi and D. J. Fleet. (2011). Minimal Hashing for Compact binary codes. ICML.

Kadiyala S, S. N. (2008). A compact multi-resolution indedx for variable length queries in time series database. Knowledge information system, 15(2), 131-147.

Meshram, B. B., & Gaikwad, G. P. (2013, 4). Different indexing techniques. International Journal of Engineering Research and Application, 3(2), 1230-1235.

Chen, J., Yuegue, C., Lia, E., Cuiping, I. L., & Jiaheng, U. L. (2013). Big Data Challenges: A data Management Perspective. Higher education press and springer verlag Berlin Heidelberg, 7(2), 157-164.

Kaisler, S.,Armour, F., Espinosa , J. A., & Money, W. (2013). Big data: issues and challenges moving forward. Hawii international conference on system sciences, 995-1004.

M. S. Charkar, (2002). Similarity estimation techniques from rounding algorithms. In Proceedings of Annual ACM Symposium on Theory of Computation, 380-388.

B. Kulis, K. Grauman, (2009, September-October). Kernelised Locality-sensitive hashing for scalable image search. In Proceedings of IEEE conference on computer vision and pattern recognition, 2130-2137.

B. Souley, A. U. Othman, A. Y. Gital and I. M. Adamu, (2019). Performance evaluation of GSPEBH for big data in cloud computing. Global Scientific Journal.

C. Yan, H. Xie, D. Yang, J. Yin, Y. Zhang, (2018). Supervised hash coding with deep neural network for environment perception of intelligent vehicles. IEEE transaction on intelligent transportation systems, 19(1), 284-295.

M. He, Y. Yang, F. Shen, N. Xie, and H. T. Shen, (2017). Hashing with Angular Reconstruction Embeddings. IEEE Transactions on Image Processing. 27(5), 545-555.

Nussinov, R., & Wolfson, H. J. (1991, 12 01). Efficient Detection of three-Dimensional Structural Motifs in Biological Macromolecules by Computer Vision techniques. Peoceedings of the National Academy of Science America, 88(23), 10495-10499.

Mehrotra, H., Majhi, B., & Gupta, P. (2010). Robust ris indexing scheme using geometric hashing of SIFT keypoints. Journal of Network and Computer Applications, 33, 300–313.

Lowe, D. (2004). Distinctive image features from scale-invariant key points. International Journal of Computer Vision 60, 91–110 (2004).

Jayaraman, U., Surya, P., & Phalguni, G. (2013). Use of geometric features of principal components for indexing a biometric database. Mathematical aand Computing Modelling, 58, 147-164.

Li, F., Yi, K., & Le, W. (2010). Top-k queries on temporal data. VLDB, 19(5), 715-733.

Sandu P, I., Zeitouni, K., Oria, V., Barth , D., & Vial, S. (2011). Indexing in network trajectory flows. VLDBJ, 20(5), 643-669.

Dittrichter. (2011). MOVIES: Indexing moving objects by shooting index images. Geoinformatics, 15(4), 727-767.

Xie, M., Wang, H., Yin, J., & Meng, X. (2007). Integrity auditing of outsourced data. In Proceedings of the International Conference on Very Large Databases, 782-793.

Vandana, D. K., Jayaraman, U., Amit, K., Aman, K. G., & Gupta, P. (2013). An efficient indexing scheme for face database using modified geometric hashing. Neurocomputing, 116, 208-221.

Umarani, J., Surya, P., & Phalguni, G. (2013). Use of geimetric features of principal components for indexing a biometric database. Mathematical and computing modelling, 58, 147-164.

Ling-Yin, Y.-T. H.-C.-C. (2013). Indexing spatial data in cloud data management. Pervasive mobile computing, 1-14.

Xiaohui, Yu, K. Q., & Nick, K. (2005). Monitoring K-nearest neighbour queries over moving objects. In Proceedings of the 21st International Conference on Data Engineering.

Wang, M., Viliam, H., John, M., & Patrick, O. (2013, 2). High volumes of event stream indexing and efficient multi-keyword searching for cloud monitoring. Future generation computer, 29, 1943-1962.

Wang, J., Wu, S., Gao, H., Li, J., & Ooi, B. C. (2010). Indexing Multidimensional Data in a Cloud System. ACM SIGMOD International conference on management of data, 591-602.

Spek, P. V., & Steven, K. (2011). Applying a dynamic threshold to improve cluster detection LSI. Science of computer programming, 76, 1261-1274.

James, C., Yiping, K., Ada, W.-C. F., & Jeffrey, X. Y. (2011). Fast grapbh query processing with low-cost index. The VLDB journal, 20(4), 521-539.

Giangreco, I., Kabary, I. A., & Schuldt, H. (2014, 06). Adam - a database and information retrieval system for big multimedia collections. IEEE International Conference on, 406-413.

Collins, E. (2014). Intersection of the cloud and big data. IEE Cloud Computing 1, 84-85.

Cackett, D. (2013). Information Management and Big data: A reference Architecture. Oracle corperation.

Wook-shin, H., Jinsoo, L., & Minh-Duc, P. (2010). iGraph: A framework for comparing a disk-based grapph indexing techniques. Proceedings of the VCLD endowment, 3(1).

Jin Z, L. C., Lin, Y., & Cai, D. (2014, august). Density Sensitive Hashing. IEEE transactions on Cybernetics, 44(8), 1362-1371.

Ferragina, P., & Rossano, V. (2007, 7). The ompressed permuterm index. In proceedings of SIGIR, 535-542.

JJinbao, W., Wu, S., Gao, J., Li, J., & Ooi, C. B. (2010). Indexing multi-dimensional data in a cloud system. SIGMOD.



  • There are currently no refbacks.

Journal of Computer Science and Engineering (JCSE)
ISSN 2721-0251 (online)
Published by : ICSE (Institute of Computer Sciences and Engineering)
Website :

Creative Commons License is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.