BigBench in Hadoop Ecosystem Essay

The first proof-of- concept implementation was built for the Attracted Aster parallel database system ND the queries were formulated in the proprietary SQL-MR. query language. To test other other systems, the queries have to be translated. In this paper, an alternative implementation of Backbench for the Hoodoo ecosystem is presented. All 30 queries of Backbench were realized with Apache Hive, Apache Hoodoo, Apache Mahout, and KNELT. We will present the different design choices we took and show a performance evaluation. Introduction Big data analytics is an ever growing field of research and business.

Due to the drastic decrease of cost of storage and computation more and more data sources become profitable for data mining. A perfect example are online stores, while earlier online shopping systems would only record successful transactions, modern systems record every single interaction of a user with the website. The former allowed for simple basket analysis techniques, while current level of detail in monitoring makes detailed user modeling possible. The growing demands on data management systems and the new forms of analysis have led to the development of a new breed of systems, big data management systems (BUDS).

Similar to the advent of database management systems, there is a vastly growing ecosystem of diverse approaches. This leads to a dilemma for customers of Beads, since there are no realistic and proven measures to compare different offerings. To this end, we have developed Backbench, the first proposal for an end to end big data analytics benchmark [1]. Backbench was designed to cover essential functional and business aspects of big data use cases. In this paper, we present an alternative implementation of the Backbench workload for the Hoodoo CEO-system.

We re-implemented all 30 queries and ran proof of concept experiments on a 1 KGB Backbench installation. The rest of the paper is organized as follows. In Section 2, we present an overview of he Backbench benchmark. Section 3 introduces the parts of the Hoodoo ecosystem that were used in our implementation. We give details on the transformation and implementation of the workload in Section 4. We present a proof of concept evaluation of our implementation in Section 5. Section 6 gives an overview of related work. We conclude with future work in Section 7. Backbench Overview Structured Data Item Marketplace Sales Web Page Unstructured Data Reviews Customer Web Log Semi-structured Data Adapted TAP-DSL Backbench Specific Fig. 1. Backbench Schema Backbench is an end-to-end big data analytics benchmark, it was built to resemble odder analytic use cases in retail business. As basis for the benchmark, the Transaction Processing Performance Council’s (TAP) new decision support benchmark TAP-DSL was chosen [2]. This choice highly sped up the development of Backbench and made it possible to start from a solid and proven foundation.

A high-level overview of the data model can be seen in Figure 1. The TOPICS data model is a snowflake schema with 6 fact tables, representing 3 sales channels, store sales, catalog sales, and online sales, each with a sales and a returns fact table. For Backbench the catalog sales were moved, since they have decreasing significance in retail business. As can be seen in Figure 1, additional big data specific dimensions were added. Marketplace is a traditional relational table storing competitors prices. The Web Log portion represents a click-stream that is used to analyze the user behavior.

This part of the data set is semiconductors, since different entries in the wobble represent different user actions and thus have different format. The log is generated in form of an Apache Web server log. The unstructured part of the schema is generated in form of product reviews. These are, for example, used for sentiment analysis. The full schema is described in [3]. Backbench features 30 complex queries, 10 of which are taken from TAP-DID. The queries are covering the major areas of big data analytics [4]. As a result, they cannot be expressed by pure SQL queries.

In Attracted Aster, this is solved using built in functions that are internally processed in a Unprepared fashion. The benchmark, however, does not dictate a specific implementation. The full list of queries can be found in [3]. 3 Technologies for Backbench on Hoodoo Mahout Hive Unprepared Zookeeper Habeas HIDES Fig. 2. Hoodoo Stack In this section, the technologies used to create an open-source implementation of Backbench are described. Backbench is mainly implemented using four pounces software frameworks: Apache Hoodoo, Apache Hive, Apache Mahout, and the Natural Language Processing Toolkit (KNELT).

We used the following versions for our implementation: Apache Hoodoo VOW. 20. 2, Apache Hive 0. 8. 1, Apache Mahout 0. 6, and KNELT 3. 0. 3. 1 Hoodoo Apache Hoodoo provides a scalable distributed file system and features to perform analysis on and store large data sets using the Unprepared framework [5]. Its architecture consists of many components and a discussion of the design decisions with implementation details can be found [6]. Only components that are most relevant to the Backbench implementation will be described in the following.

The Hoodoo Distributed File System (HIDES) is modeled after the Unix files yester hierarchy with 3-way replication of data for security and analysis performance purposes. The Hoodoo command line interface provides access to a most standard Unix file operations such as Is, arm, cap, etc. A complete reference can be found on Apache Hoodoo’s websites . Http://hoodoo. Apache. Org/ A cluster implementing Hoodoo has 3 main components: HIDES client, nematode, and donated. The nematode primarily stores meta data. It keeps a record of the namespace tree, which stores information relevant to file block allocation to donated.

It should be noted that all of the namespace data is stored in RAM. There can only be one nematode in any single cluster in the version of Hoodoo used. However, on the other hand, there are usually multiple donated in a cluster. Each donated contains two files in the local file system: one to store metadata and the other to store the actual data. The HIDES client provides an interface for user-created applications to access and modify HIDES. Access is provided in a two-tiered process: iris, the metadata in the nematode is extracted and then information is used to access the relevant detonates.