Big Data? Hadoop might be your thing

Hadoop

Software as a Service (SaaS) can turn out to be the Sales and Marketing’s biggest demon if the business plan is a faulty one. The SaaS model itself has little to take blame. SaaS was to be shared with entities that could construct the ideal enterprise level solution and provide its constant support during the implementation. It doesn't always work out that way, but some solutions work better than others.  

Hadoop like many of the other big data solutions has been a game changer. Unlike other big data solutions, Hadoop is more accessible. Amongst others, its benefits include:

Computing Power: Its distributed computing model quickly processes big data. The more computing nodes you use, the more processing power you have.

Flexibility: Unlike traditional relational databases, you don’t have to preprocess data before storing it. You can store as much data as you want and decide how to use it later. That includes unstructured data like text, images and videos.

Fault tolerance: Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. And it automatically stores multiple copies of all data.

Low cost: The open-source framework is free and uses commodity hardware to store large quantities of data.

Scalability: You can easily grow your system simply by adding more nodes. Little administration is required.

Most of us have complained about Java’s enterprise level having a low storage tolerance. Coding repositories do try to address that, but there’s just not enough power in that. For those of you looking for Hadoop’s standalone can now do so in some relatively easier steps.

Virtual Box

You will need to get started with the Unix flavor of choice. I am not for ditching the base OS, especially if you are using windows. Even if you are not using Windows, I still recommend a virtual machine with a Unix flavor. Optimally Ubuntu is a very good choice, but Red Hat also works well.

You can follow the step by step for the installation over here. Just ensure that you keep your ram proper on the VM as Hadoop is not a light software to configure.

Map Reduce

Think of Map reduction like splitting your data into equal chunks for better storage. There are two sets of operations to it: the map tasks and the reduce tasks. Once the reduction is complete, the storage takes place in a different format of the file system. Optimal and simple.

The framework possesses the same computer and storage nodes, though, which can be attributed as one of its criticisms. The framework is also divided between the master JobTracker and the slave Tasktracker per cluster node. Since your data is stored on these clusters, the Tracker schedules and then executes.

Take a look at the source code for an example as Apache is pretty much in line with Hadoop.

Public class WordCount {

 

 

 

13

 

14

public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

15

private final static IntWritable one = new IntWritable(1);

16

private Text word = new Text();

17

 

18

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {

19

String line = value.toString();

20

StringTokenizer tokenizer = new StringTokenizer(line);

21

while (tokenizer.hasMoreTokens()) {

22

word.set(tokenizer.nextToken());

23

output.collect(word, one);

24

}

25

}

26

}

27

 

28

public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {

29

public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {

30

int sum = 0;

31

while (values.hasNext()) {

32

sum += values.next().get();

33

}

34

output.collect(key, new IntWritable(sum));

35

}

36

}

37

public static void main(String[] args) throws Exception {

38

JobConf conf = new JobConf(WordCount.class);

39

conf.setJobName("wordcount");

40

 

41

 

42

conf.setOutputKeyClass(Text.class);

43

conf.setOutputValueClass(IntWritable.class);

44

 

45

conf.setMapperClass(Map.class);

46

conf.setCombinerClass(Reduce.class);

47

conf.setReducerClass(Reduce.class);

48

 

49

conf.setInputFormat(TextInputFormat.class);

50

conf.setOutputFormat(TextOutputFormat.class);

51

 

52

FileInputFormat.setInputPaths(conf, new Path(args[0]));

53

FileOutputFormat.setOutputPath(conf, new Path(args[1]));

54

JobClient.runJob(conf);

55

 

56

}

57

 

Output

$ bin/hadoop dfs -cat /usr/joe/wordcount/output/part-00000 
Bye 1 
Goodbye 1 
Hadoop 2 
Hello 2 
World 2  - Apache

You can always make the Map Reduction on platforms and languages of your choice like Python or Golang. If you are adopting the framework, though, you should always make a note to check the support the framework has for the language of your choice. In case the scripting interface is undergoing a change like Angular or PHP.

Getting used to the ecosystem

For all the users this is a tricky bit even after getting through the commands on the VM. You can follow any follow through on setting up the standalone, but the components you should be going through is still up for debate. My recommendation is to get going with Hive as its tabular sequences and querying is very much similar to SQL, only with increased functions for data summarization.

You will also be needing Sqoop if you plan on using multiple tracks or parts of data for your BI solution. Typically if you are working with E-commerce, then your data sources may be multiple I know for a fact that porting data from Oracle DBs or MySQL DBs is not an easy task. Oracle DBs are particularly difficult to port for most solutions cut components like SQoop make the task a tad bit simpler.

HBase, Ansari, Solr, etc. are also names you should keep in mind. Of course, you will have your favorites but go through as many of components as possible if you want to maximize Hadoop's usage fully. Happy hunting!