Big Data: March 2014

Your cluster’s HDFS block size in 64MB. You have directory containing 100 plain text files, each of which is 100MB in size. The InputFormat for your job is TextInputFormat. Determine how many Mappers will run?A. 64
B. 100
C. 200
D. 640
Answer: C

Can you use MapReduce to perform a relational join on two large tables sharing a key?
Assume that the two tables are formatted as comma-separated files in HDFS.
A. Yes.
B. Yes, but only if one of the tables fits into memory
C. Yes, so long as both tables fit into memory.
D. No, MapReduce cannot perform relational operations.
E. No, but it can be done with either Pig or Hive.
Answer: A

Which process describes the lifecycle of a Mapper?
A. The JobTracker calls the TaskTracker’s configure () method, then its map () method and finally
its close () method.
B. The TaskTracker spawns a new Mapper to process all records in a single input split.
C. The TaskTracker spawns a new Mapper to process each key-value pair.
D. The JobTracker spawns a new Mapper to process all records in a single file.
Answer: C

In a MapReduce job with 500 map tasks, how many map task attempts will there be?
A. It depends on the number of reduces in the job.
B. Between 500 and 1000.
C. At most 500.
D. At least 500.
E. Exactly 500.
Answer: D

MapReduce v2 (MRv2 /YARN) splits which major functions of the JobTracker into separate
daemons? Select two.
A. Heath states checks (heartbeats)
B. Resource management
C. Job scheduling/monitoring
D. Job coordination between the ResourceManager and NodeManager
E. Launching tasks
F. Managing file system metadata
G. MapReduce metric reporting
H. Managing tasks
Answer: B,D

When is the earliest point at which the reduce method of a given Reducer can be called?
A. As soon as at least one mapper has finished processing its input split.
B. As soon as a mapper has emitted at least one record.
C. Not until all mappers have finished processing all records.
D. It depends on the InputFormat used for the job.
Answer: C

In a large MapReduce job with m mappers and n reducers, how many distinct copy operations will there be in the sort/shuffle phase?
A. m * n (i.e., m multiplied by n)
B. n
C. m
D. m+n (i.e., m plus n)
E. E.mn(i.e., m to the power of n)
Answer: A

For each intermediate key, each reducer task can emit:
A. As many final key-value pairs as desired. There are no restrictions on the types of those keyvalue
pairs (i.e., they can be heterogeneous).
B. As many final key-value pairs as desired, but they must have the same type as the intermediate
key-value pairs.
C. As many final key-value pairs as desired, as long as all the keys have the same type and all the
values have the same type.
D. One final key-value pair per value associated with the key; no restrictions on the type.
E. One final key-value pair per key; no restrictions on the type.
Answer: E

You need to move a file titled “weblogs” into HDFS. When you try to copy the file, you can’t.
You know you have ample space on your DataNodes. Which action should you take to relieve this situation and store more files in HDFS?
A. Increase the block size on all current files in HDFS.
B. Increase the block size on your remaining files.
C. Decrease the block size on your remaining files.
D. Increase the amount of memory for the NameNode.
E. Increase the number of disks (or size) for the NameNode.
F. Decrease the block size on all current files in HDFS.
Answer: C

Indentify which best defines a SequenceFile?
A. A SequenceFile contains a binary encoding of an arbitrary number of homogeneous Writable objects
B. A SequenceFile contains a binary encoding of an arbitrary number of heterogeneous Writable objects
C. A SequenceFile contains a binary encoding of an arbitrary number of WritableComparable objects, in sorted order.
D. A SequenceFile contains a binary encoding of an arbitrary number key-value pairs. Each key must be the same type. Each value must be the same type.
Answer: D

When is the earliest point at which the reduce method of a given Reducer can be called?

A. As soon as at least one mapper has finished processing its input split.

B. As soon as a mapper has emitted at least one record.

C. Not until all mappers have finished processing all records.

D. It depends on the InputFormat used for the job.

Answer: C

Explanation

Which describes how a client reads a file from HDFS?
A. The client queries the NameNode for the block location(s). The NameNode returns the block location(s) to the client. The client reads the data directory off the DataNode(s).

B. The client queries all DataNodes in parallel. The DataNode that contains the
requested data responds directly to the client. The client reads the data directly off the DataNode.

C. The client contacts the NameNode for the block location(s). The NameNode then queries the DataNodes for block locations. The DataNodes respond to the NameNode, and the NameNode redirects the client to the DataNode that holds the requested data block(s). The client then reads the data directly off the DataNode.

D. The client contacts the NameNode for the block location(s). The NameNode contacts the DataNode that holds the requested data block. Data is transferred from the DataNode to the NameNode, and then from the NameNode to the client

Answer: C

You are developing a combiner that takes as input Text keys, IntWritable values, and emits Text keys, IntWritable values. Which interface should your class implement?

A. Combiner <Text, IntWritable, Text, IntWritable>
B. Mapper <Text, IntWritable, Text, IntWritable>
C. Reducer <Text, Text, IntWritable, IntWritable>
D. Reducer <Text, IntWritable, Text, IntWritable>
E. Combiner <Text, Text, IntWritable, IntWritable>

Answer: D

Indentify the utility that allows you to create and run MapReduce jobs with any
executable or script as the mapper and/or the reducer?
A. Oozie
B. Sqoop
C. Flume
D. Hadoop Streaming
E. mapred
Answer: D

Explanation:
Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer.

How are keys and values presented and passed to the reducers during a standard sort and shuffle phase of MapReduce?
A. Keys are presented to reducer in sorted order; values for a given key are not sorted.
B. Keys are presented to reducer in sorted order; values for a given key are sorted in ascending order.
C. Keys are presented to a reducer in random order; values for a given key are not sorted.
D. Keys are presented to a reducer in random order; values for a given key are sorted in ascending order.
Answer: A

Explanation:
Reducer has 3 primary phases:

1.Shuffle
The Reducer copies the sorted output from each Mapper using HTTP across the network.

2.Sort
The framework merge sorts Reducer inputs by keys (since different Mappers may have output the same key).
The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged.

SecondarySort
To achieve a secondary sort on the values returned by the value iterator, the application should extend the key with the secondary key and define a grouping comparator. The keys will be sorted using the entire key, but will be grouped using the grouping comparator to decide which keys and values are sent in the same call to reduce.

3.Reduce
In this phase the reduce(Object, Iterable, Context) method is called for each (collection of values)> in the sorted inputs.
The output of the reduce task is typically written to a RecordWriter via TaskInputOutputContext.write(Object, Object). The output of the Reducer is not re-sorted.

Assuming default settings, which best describes the order of data provided to a reducer’s
reduce method:
A. The keys given to a reducer aren’t in a predictable order, but the values associated with those keys always are.
B. Both the keys and values passed to a reducer always appear in sorted order.
C. Neither keys nor values are in any predictable order.
D. The keys given to a reducer are in sorted order but the values associated with each key are in no predictable order
Answer: D

Explanation:
Reducer has 3 primary phases:
1.Shuffle
The Reducer copies the sorted output from each Mapper using HTTP across the
network.
2.Sort
The framework merge sorts Reducer inputs by keys (since different Mappers may have output the same key). The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged.
SecondarySort
To achieve a secondary sort on the values returned by the value iterator, the application should extend the key with the secondary key and define a grouping comparator. The keys will be sorted using the entire key, but will be grouped using the grouping comparator to decide which keys and values are sent in the same call to reduce.
3.Reduce
In this phase the reduce(Object, Iterable, Context) method is called for each (collection of values)> in the sorted inputs. The output of the reduce task is typically written to a RecordWriter via TaskInputOutputContext.write(Object, Object). The output of the Reducer is not re-sorted

You wrote a map function that throws a runtime exception when it encounters a control character in input data. The input supplied to your mapper contains twelve such characters totals, spread across five file splits. The first four file splits each have two control characters and the last split has four control characters. Indentify the number of failed task attempts you can expect when you run the job with
mapred.max.map.attempts set to 4:
A. You will have forty-eight failed task attempts
B. You will have seventeen failed task attempts
C. You will have five failed task attempts
D. You will have twelve failed task attempts
E. You will have twenty failed task attempts
Answer: E
Explanation:
There will be four failed task attempts for each of the five file splits.

You want to populate an associative array in order to perform a map-side join. You’ve decided to put this information in a text file, place that file into the DistributedCache and read it in your Mapper before any records are processed. Indentify which method in the Mapper you should use to implement code for reading the file and populating the associative array?

A. combine
B. map
C. init
D. configure

Answer: D

Explanation:
See 3) below. Here is an illustrative example on how to use the DistributedCache:
// Setting up the cache for the application
1. Copy the requisite files to the FileSystem:
$ bin/hadoop fs -copyFromLocal lookup.dat /myapp/lookup.dat
$ bin/hadoop fs -copyFromLocal map.zip /myapp/map.zip
$ bin/hadoop fs -copyFromLocal mylib.jar /myapp/mylib.jar
$ bin/hadoop fs -copyFromLocal mytar.tar /myapp/mytar.tar
$ bin/hadoop fs -copyFromLocal mytgz.tgz /myapp/mytgz.tgz
$ bin/hadoop fs -copyFromLocal mytargz.tar.gz /myapp/mytargz.tar.gz

2. Setup the application's JobConf:
JobConf job = new JobConf();
DistributedCache.addCacheFile(new URI("/myapp/lookup.dat#lookup.dat"), job);
DistributedCache.addCacheArchive(new URI("/myapp/map.zip", job);
DistributedCache.addFileToClassPath(new Path("/myapp/mylib.jar"), job);
DistributedCache.addCacheArchive(new URI("/myapp/mytar.tar", job);
DistributedCache.addCacheArchive(new URI("/myapp/mytgz.tgz", job);
DistributedCache.addCacheArchive(new URI("/myapp/mytargz.tar.gz", job);

3. Use the cached files in theMapper orReducer: public static class MapClass
extends MapReduceBase implements Mapper {
private Path[] localArchives;
private Path[] localFiles;
public void configure(JobConf job) {
// Get the cached archives/files
localArchives = DistributedCache.getLocalCacheArchives(job);
localFiles = DistributedCache.getLocalCacheFiles(job);
}
public void map(K key, V value, OutputCollector output, Reporter reporter)
throws IOException {
// Use data from the cached archives/files here
// ...
// ... output.collect(k, v);
}

}

You’ve written a MapReduce job that will process 500 million input records and generated 500 million key-value pairs. The data is not uniformly distributed. Your MapReduce job will create a significant amount of intermediate data that it needs to transfer between mappers and reduces which is a potential bottleneck. A custom implementation of which interface is most likely to reduce the amount of intermediate data transferred across the network?

A. Partitioner
B. OutputFormat
C. WritableComparable
D. Writable
E. InputFormat
F. Combiner

Answer: F

Explanation:
Combiners are used to increase the efficiency of a MapReduce program. They are used to aggregate intermediate map output locally on individual mapper outputs. Combiners can help you reduce the amount of data that needs to be transferred across to the reducers. You can use your reducer code as a combiner if the operation performed is commutative and associative.

Can you use MapReduce to perform a relational join on two large tables sharing a key? Assume that the two tables are formatted as comma-separated files in HDFS.
A. Yes.
B. Yes, but only if one of the tables fits into memory
C. Yes, so long as both tables fit into memory.
D. No, MapReduce cannot perform relational operations.
E. No, but it can be done with either Pig or Hive.
Answer: A

Explanation:
Join Algorithms in MapReduce

Reduce-side join
Map-side join

You''ve written a MapReduce job that will process 500 million input records and generated 500 million key-value pairs. The data is not uniformly distributed. Your MapReduce job will create asignificant amount of intermediate data that it needs to transfer between mappers and reduces which is a potential bottleneck. A custom implementation of which interface is most likely to reduce the amount of intermediate data transferred across the network?
A. Partitioner
B. OutputFormat
C. WritableComparable
D. Writable
E. InputFormat
F. Combiner

You have just executed a MapReduce job. Where is intermediate data written to after being emitted from the Mapper''s map method?

A. Intermediate data in streamed across the network from Mapper to the Reduce and is never written to disk.
B. Into in-memory buffers on the TaskTracker node running the Mapper that spill over and are written into HDFS.
C. Into in-memory buffers that spill over to the local file system of the TaskTracker node running the Mapper.
D. Into in-memory buffers that spill over to the local file system (outside HDFS) of the TaskTracker node running the Reducer
E. Into in-memory buffers on the TaskTracker node running the Reducer that spill over and are written into HDFS.

You want to understand more about how users browse your public website, such as which pages they visit prior to placing an order. You have a farm of 200 web servers hosting your website. How will you gather this data for your analysis?

Ingest the server web logs into HDFS using Flume.
Write a MapReduce job, with the web servers for mappers, and the Hadoop cluster nodes forreduces.
Import all users'' clicks from your OLTP databases into Hadoop, using Sqoop.
Channel these clickstreams inot Hadoop using Hadoop Streaming.
Sample the weblogs from the web servers, copying them into Hadoop using curl.

MapReduce v2 (MRv2/YARN) is designed to address which two issues?
A. Single point of failure in the NameNode.
B. Resource pressure on the JobTracker.
C. HDFS latency.
D. Ability to run frameworks other than MapReduce, such as MPI.
E. Reduce complexity of the MapReduce APIs.
F. Standardize on a single MapReduce API.

You need to run the same job many times with minor variations. Rather than hardcoding all job configuration options in your drive code, you''ve decided to have your Driver subclass org.apache.hadoop.conf.Configured and implement the org.apache.hadoop.util.Tool interface. Indentify which invocation correctly passes.mapred.job.name with a value of Example to Hadoop?

A. hadoop ''mapred.job.name=Example'' MyDriver input output
B. hadoop MyDriver mapred.job.name=Example input output
C. hadoop MyDrive ''D mapred.job.name=Example input output
D. hadoop setproperty mapred.job.name=Example MyDriver input output
E. hadoop setproperty (''mapred.job.name=Example'') MyDriver input output

You are developing a MapReduce job for sales reporting. The mapper will process input keys representing the year (IntWritable) and input values representing product indentifies (Text). Indentify what determines the data types used by the Mapper for a given job.

The key and value types specified in the JobConf.setMapInputKeyClass and JobConf.setMapInputValuesClass methods
The data types specified in HADOOP_MAP_DATATYPES environment variable
The mapper-specification.xml file submitted with the job determine the mapper''s input key and value types.
The InputFormat used by the job determines the mapper''s input key and value types.

Identify the MapReduce v2 (MRv2 / YARN) daemon responsible for launching application containers and monitoring application resource usage?
A. ResourceManager
B. NodeManager
C. ApplicationMaster
D. ApplicationMasterService
E. TaskTracker
F. JobTracker

Which best describes how TextInputFormat processes input files and line breaks?
A. Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReader of the split that contains the beginning of the broken line.
B. Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReaders of both splits containing the broken line.
C. The input file is split exactly at the line breaks, so each RecordReader will read a series of complete lines.
D. Input file splits may cross line breaks. A line that crosses file splits is ignored.
E. Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReader of the split that contains the end of the broken line.

For each input key-value pair, mappers can emit:
A. As many intermediate key-value pairs as designed. There are no restrictions on the types of those key-value pairs (i.e., they can be heterogeneous).
B. As many intermediate key-value pairs as designed, but they cannot be of the same type as the input key-value pair.
C. One intermediate key-value pair, of a different type.
D. One intermediate key-value pair, but of the same type.
E. As many intermediate key-value pairs as designed, as long as all the keys have the same types and all the values have the same type.

You have the following key-value pairs as output from your Map task:
(the, 1)
(fox, 1)
(faster, 1)
(than, 1)
(the, 1)
(dog, 1)
How many keys will be passed to the Reducer''s reduce method?
A. Six
B. Five
C. Four
D. Two
E. One
F. Three

You have user profile records in your OLPT database, that you want to join with web logs you have already ingested into the Hadoop file system. How will you obtain these user records?
A. HDFS command
B. Pig LOAD command
C. Sqoop import
D. Hive LOAD DATA command
E. Ingest with Flume agents
F. Ingest with Hadoop Streaming

Which two updates occur when a client application opens a stream to begin a file write on a cluster running MapReduce v1 (MRv1)?

A. Once the write stream closes on the DataNode, the DataNode immediately initiates a black report to the NameNode.
B. The change is written to the NameNode disk.
C. The metadata in the RAM on the NameNode is flushed to disk.
D. The metadata in RAM on the NameNode is flushed disk.
E. The metadata in RAM on the NameNode is updated.
F. The change is written to the edits file.

Answer :

For a MapReduce job, on a cluster running MapReduce v1 (MRv1), what''s the relationship between tasks and task templates?

A. There are always at least as many task attempts as there are tasks.
B. There are always at most as many tasks attempts as there are tasks.
C. There are always exactly as many task attempts as there are tasks.
D. The developer sets the number of task attempts on job submission.

What action occurs automatically on a cluster when a DataNode is marked as dead?
A. The NameNode forces re-replication of all the blocks which were stored on the dead DataNode.
B. The next time a client submits job that requires blocks from the dead DataNode, the JobTracker receives no heart beats from the DataNode. The JobTracker tells the NameNode that the DataNode is dead, which triggers block re-replication on the cluster.
C. The replication factor of the files which had blocks stored on the dead DataNode is temporarily reduced, until the dead DataNode is recovered and returned to the cluster.
D. The NameNode informs the client which write the blocks that are no longer available; the client then re-writes the blocks to a different DataNode.

How does the NameNode know DataNodes are available on a cluster running MapReduce v1(MRv1)
A. DataNodes listed in the dfs.hosts file. The NameNode uses as the definitive list of available DataNodes.
B. DataNodes heartbeat in the master on a regular basis.
C. The NameNode broadcasts a heartbeat on the network on a regular basis, and DataNodes respond.
D. The NameNode send a broadcast across the network when it first starts, and DataNodes respond.

Which three distcp features can you utilize on a Hadoop cluster?
A. Use distcp to copy files only between two clusters or more. You cannot use distcp to copy data between directories inside the same cluster.
B. Use distcp to copy HBase table files.
C. Use distcp to copy physical blocks from the source to the target destination in your cluster.
D. Use distcp to copy data between directories inside the same cluster.
E. Use distcp to run an internal MapReduce job to copy files.

How does HDFS Federation help HDFS Scale horizontally?
A. HDFS Federation improves the resiliency of HDFS in the face of network issues by removing the NameNode as a single-point-of-failure.
B. HDFS Federation allows the Standby NameNode to automatically resume the services of an active NameNode.
C. HDFS Federation provides cross-data center (non-local) support for HDFS, allowing a cluster administrator to split the Block Storage outside the local cluster.
D. HDFS Federation reduces the load on any single NameNode by using the multiple, independent NameNode to manage individual pars of the filesystem namespace.

Choose which best describe a Hadoop cluster's block size storage parameters once you set the HDFS default block size to 64MB?
A. The block size of files in the cluster can be determined as the block is written.
B. The block size of files in the Cluster will all be multiples of 64MB.
C. The block size of files in the duster will all at least be 64MB.
D. The block size of files in the cluster will all be the exactly 64MB.

Which MapReduce daemon instantiates user code, and executes map and reduce tasks on a cluster running MapReduce v1 (MRv1)?
A. NameNode
B. DataNode
C. JobTracker
D. TaskTracker
E. ResourceManager
F. ApplicationMaster
G. NodeManager

What two processes must you do if you are running a Hadoop cluster with a single NameNode and six DataNodes, and you want to change a configuration parameter so that it affects all six DataNodes.
A. You must restart the NameNode daemon to apply the changes to the cluster
B. You must restart all six DataNode daemons to apply the changes to the cluster.
C. You don't need to restart any daemon, as they will pick up changes automatically.
D. You must modify the configuration files on each of the six DataNode machines.
E. You must modify the configuration files on only one of the DataNode machine
F. You must modify the configuration files on the NameNode only. DataNodes read their configuration from the master nodes.

Identify the function performed by the Secondary NameNode daemon on a cluster configured to run with a single NameNode.
A. In this configuration, the Secondary NameNode performs a checkpoint operation on the files by the NameNode.
B. In this configuration, the Secondary NameNode is standby NameNode, ready to failover and provide high availability.
C. In this configuration, the Secondary NameNode performs deal-time backups of the NameNode.
D. In this configuration, the Secondary NameNode servers as alternate data channel for clients to reach HDFS, should the NameNode become too busy.

You install Cloudera Manager on a cluster where each host has 1 GB of RAM. All of the services show their status as concerning. However, all jobs submitted complete without an error. Why is Cloudera Manager showing the concerning status KM the services?
A. A slave node's disk ran out of space
B. The slave nodes, haven't sent a heartbeat in 60 minutes
C. The slave nodes are swapping.
D. DataNode service instance has crashed.

What is the recommended disk configuration for slave nodes in your Hadoop cluster with 6 x 2 TB hard drives?
A. RAID 10
B. JBOD
C. RAID 5
D. RAID 1+0

You configure you cluster with HDFS High Availability (HA) using Quorum-Based storage. You do not implement HDFS Federation.
What is the maximum number of NameNodes daemon you should run on you cluster in order to avoid a ''split-brain'' scenario with your NameNodes?
A. Unlimited. HDFS High Availability (HA) is designed to overcome limitations on the number of
NameNodes you can deploy.
B. Two active NameNodes and one Standby NameNode
C. One active NameNode and one Standby NameNode
D. Two active NameNodes and two Standby NameNodes

You configure Hadoop cluster with both MapReduce frameworks, MapReduce v1 (MRv1) and MapReduce v2 (MRv2/YARN). Which two MapReduce (computational) daemons do you need to configure to run on your master nodes?
A. JobTracker
B. ResourceManager
C. ApplicationMaster
D. JournalNode
E. NodeManager

You observe that the number of spilled records from map tasks for exceeds the number of map output records. You child heap size is 1 GB and your io.sort.mb value is set to 100MB. How would you tune your io.sort.mb value to achieve maximum memory to disk I/O ratio?
A. Tune io.sort.mb value until you observe that the number of spilled records equals (or is as close
to equals) the number of map output records.
B. Decrease the io.sort.mb value below 100MB.
C. Increase the IO.sort.mb as high you can, as close to 1GB as possible.
D. For 1GB child heap size an io.sort.mb of 128MB will always maximum memory to disk I/O.

Your Hadoop cluster has 25 nodes with a total of 100 TB (4 TB per node) of raw disk space allocated HDFS storage. Assuming Hadoop's default configuration, how much data will you be able to store?
A. Approximately 100TB
B. Approximately 25TB
C. Approximately 10TB
D. Approximately 33 TB

You set up the Hadoop cluster using NameNode Federation. One NameNode manages the/users namespace and one NameNode manages the/data namespace. What happens when client tries to write a file to/reports/myreport.txt?
A. The file successfully writes to /users/reports/myreports/myreport.txt.
B. The client throws an exception.
C. The file successfully writes to /report/myreport.txt. The metadata for the file is managed by the first NameNode to which the client connects.
D. The file writes fails silently; no file is written, no error is reported.

Identify two features/issues that MapReduce v2 (MRv2/YARN) is designed to address:
A. Resource pressure on the JobTrackr
B. HDFS latency.
C. Ability to run frameworks other than MapReduce, such as MPI.
D. Reduce complexity of the MapReduce APIs.
E. Single point of failure in the NameNode.
F. Standardize on a single MapReduce API.

The most important consideration for slave nodes in a Hadoop cluster running production jobs that require short turnaround times is:
A. The ratio between the amount of memory and the number of disk drives.
B. The ratio between the amount of memory and the total storage capacity.
C. The ratio between the number of processor cores and the amount of memory.
D. The ratio between the number of processor cores and total storage capacity.
E. The ratio between the number of processor cores and number of disk drives.

The failure of which daemon makes HDFS unavailable on a cluster running MapReduce v1(MRv1)?
A. Node Manager
B. Application Manager
C. Resource Manager
D. Secondary NameNode
E. NameNode
F. DataNode

What is the difference between a Hadoop database and Relational Database?

Hadoop is not a database, it is an architecture with a filesystem called HDFS. The data is stored in HDFS which does not have any predefined containers.
Relational database stores data in predefined containers.

what is HDFS?
Stands for Hadoop Distributed File System. It uses a framework involving many machines which stores large amounts of data in files over a Hadoop cluster.

what is MAP REDUCE?

Map Reduce is a set of programs used to access and manipulate large data sets over a Hadoop cluster.

What is the InputSplit in map reduce software?

An inputsplit is the slice of data to be processed by a single Mapper. It generally is of the block size which is stored on the datanode.

what is meaning Replication factor?

Replication factor defines the number of times a given data block is stored in the cluster. The default replication factor is 3. This also means that you need to have 3times the amount of storage needed to store the data. Each file is split into data blocks and spread across the cluster.

what is the default replication factor in HDFS?

The default hadoop comes with 3 replication factor. You can set the replication level individually for each file in HDFS. In addition to fault tolerance having replicas allow jobs that consume the same data to be run in parallel. Also if there are replicas of the data hadoop can attempt to run multiple copies of the same task and take which ever finishes first. This is useful if for some reason a box is being slow.

Most Hadoop administrators set the default replication factor for their files to be three. The main assumption here is that if you keep three copies of the data, your data is safe. this to be true in the big clusters that we manage and operate.

In addition to fault tolerance having replicas allow jobs that consume the same data to be run in parallel. Also if there are replicas of the data hadoop can attempt to run multiple copies of the same task and take which ever finishes first. This is useful if for some reason a box is being slow.

what is the typical block size of an HDFS block?

Default blocksize is 64mb. But 128mb is typical.

what is namenode?
Name node is one of the daemon that runs in Master node and holds the meta info where particular chunk of data (ie. data node ) resides.Based on meta info maps the incoming job to corresponding data node.

How does master slave architecture in the Hadoop?

Totally 5 daemons run in Hadoop Master-slave architecture .
On Master Node : Name Node and Job Tracker and Secondary name node
On Slave : Data Node and Task Tracker
But its recommended to run Secondary name node in a separate machine which have Master node capacity.

What is compute and Storage nodes?

I do define Hadoop into 2 ways :
Distributed Processing : Map - Reduce
Distributed Storage : HDFS
Name Node holds Meta info and Data holds exact data and its MR program.

Explain how input and output data format of the Hadoop framework?

Fileinputformat, textinputformat, keyvaluetextinputformat, sequencefileinputformat, sequencefileasinputtextformat, wholefileformat are file formats in hadoop framework

How can we control particular key should go in a specific reducer?

By using a custom partitioner.

What is the Reducer used for?

Reducer is used to combine the multiple outputs of mapper to one.

What are the primary phases of the Reducer?

Reducer has 3 primary phases: shuffle, sort and reduce.

What happens if number of reducers are 0?

It is legal to set the number of reduce-tasks to zero if no reduction is desired.
In this case the outputs of the map-tasks go directly to the FileSystem, into the output path set by setOutputPath(Path). The framework does not sort the map-outputs before writing them out to the FileSystem.

How many instances of JobTracker can run on a Hadoop Cluser?

One. There can only be one JobTracker in the cluster. This can be run on the same machine running the NameNode.

How NameNode Handles data node failures?

Through checksums. every data has a record followed by a checksum. if checksum doesnot match with the original then it reports an data corrupted error.

Can I set the number of reducers to zero?

can be given as zero. So, the mapper output is an finalised output and stores in HDFS.

What is a SequenceFile in Hadoop?

A. ASequenceFilecontains a binaryencoding ofan arbitrary numberof homogeneous writable objects.
B. ASequenceFilecontains a binary encoding of an arbitrary number of heterogeneous writable objects.
C. ASequenceFilecontains a binary encoding of an arbitrary number of WritableComparable objects, in sorted order.
D. ASequenceFilecontains a binary encoding of an arbitrary number key-value pairs. Each key must be the same type. Each value must be sametype.

Answer: D

Is there a map input format in Hadoop?

A. Yes, but only in Hadoop 0.22+.
B. Yes, there is a special format for map files.
C. No, but sequence file input format can read map files.
D. Both 2 and 3 are correct answers.
Answers: C

What happens if mapper output does not match reducer input in Hadoop?

A. Hadoop API will convert the data to the type that is needed by the reducer.
B. Data input/output inconsistency cannot occur. A preliminary validation check is executed prior to the full execution of the job to ensure there is consistency.
C. The java compiler will report an error during compilation but the job will complete with exceptions.
D. A real-time exception will be thrown and map-reduce job will fail.

Answer: D

Can you provide multiple input paths to a map-reduce jobs Hadoop?

A. Yes, but only in Hadoop 0.22+.
B. No, Hadoop always operates on one input directory.
C. Yes, developers can add any number of input paths.
D. Yes, but the limit is currently capped at 10 input paths.

Answer: C

Can a custom type for data Map-Reduce processing be implemented in Hadoop?

A. No, Hadoop does not provide techniques for custom datatypes.
B. Yes, but only for mappers.
C. Yes, custom data types can be implemented as long as they implement writable interface.
D. Yes, but only for reducers.

Answer: C

The Hadoop API uses basic Java types such as LongWritable, Text, IntWritable. They have almost the same features as default java classes. What are these writable data types optimized for?

A. Writable data types are specifically optimized for network transmissions
B. Writable data types are specifically optimized for file system storage
C. Writable data types are specifically optimized for map-reduce processing
D. Writable data types are specifically optimized for data retrieval

Answer: A

What is writable in Hadoop?

A. Writable is a java interface that needs to be implemented for streaming data to remote servers.
B. Writable is a java interface that needs to be implemented for HDFS writes.
C. Writable is a java interface that needs to be implemented for MapReduce processing.
D. None of these answers are correct.

Answer: C

What is the best performance one can expect from a Hadoop cluster?

A. The best performance expectation one can have is measured in seconds. This is because Hadoop can only be used for batch processing
B. The best performance expectation one can have is measured in milliseconds. This is because Hadoop executes in parallel across so many machines
C. The best performance expectation one can have is measured in minutes. This is because Hadoop can only be used for batch processing
D. It depends on on the design of the map-reduce program, how many machines in the cluster, and the amount of data being retrieved

Answer: A

What is distributed cache in Hadoop?

A. The distributed cache is special component on namenode that will cache frequently used data for faster client response. It is used during reduce step.
B. The distributed cache is special component on datanode that will cache frequently used data for faster client response. It is used during map step.
C. The distributed cache is a component that caches java objects.
D. The distributed cache is a component that allows developers to deploy jars for Map-Reduce processing.

Answer: D

Can you run Map - Reduce jobs directly on Avro data in Hadoop?

A. Yes, Avro was specifically designed for data processing via Map-Reduce
B. Yes, but additional extensive coding is required
C. No, Avro was specifically designed for data storage only
D. Avro specifies metadata that allows easier data access. This data cannot be used as part of map-reduce execution, rather input specification only.

Answer: A

What is AVRO in Hadoop?

A. Avro is a java serialization library
B. Avro is a java compression library
C. Avro is a java library that create splittable files
D. None of these answers are correct

Answer: A

Will settings using Java API overwrite values in configuration files in Hadoop?

A. No. The configuration settings in the configuration file takes precedence
B. Yes. The configuration settings using Java API take precedence
C. It depends when the developer reads the configuration file. If it is read first then no.
D. Only global configuration settings are captured in configuration files on namenode. There are only a very few job parameters that can be set using Java API.

Answer: B

Which is faster: Map-side join or Reduce-side join? Why?

A. Both techniques have about the the same performance expectations.
B. Reduce-side join because join operation is done on HDFS.
C. Map-side join is faster because join operation is done in memory.
D. Reduce-side join because it is executed on a the namenode which will have faster CPU and more memory.

Answer: C

What are the common problems with map-side join in Hadoop?

A. The most common problem with map-side joins is introducing a high level of code complexity. This complexity has several downsides: increased risk of bugs and performance degradation. Developers are cautioned to rarely use map-side joins.
B. The most common problem with map-side joins is lack of the avaialble map slots since map-side joins require a lot of mappers.
C. The most common problems with map-side joins are out of memory exceptions on slave nodes.
D. The most common problem with map-side join is not clearly specifying primary index in the join. This can lead to very slow performance on large datasets.

Answer: C

How can you overwrite the default input format in Hadoop?

A. In order to overwrite default input format, the Hadoop administrator has to change default settings in config file.
B. In order to overwrite default input format, a developer has to set new input format on job config before submitting the job to a cluster.
C. The default input format is controlled by each individual mapper and each line needs to be parsed indivudually.
D. None of these answers are correct.

Answer: B

What is the default input format in Hadoop?

A. The default input format is xml. Developer can specify other input formats as appropriate if xml is not the correct input.
B. There is no default input format. The input format always should be specified.
C. The default input format is a sequence file format. The data needs to be preprocessed before using the default input format.
D. The default input format is TextInputFormat with byte offset as a key and entire line as a value.

Answer: D

Why would a developer create a map-reduce without the reduce step Hadoop?

A. Developers should design Map-Reduce jobs without reducers only if no reduce slots are available on the cluster.
B. Developers should never design Map-Reduce jobs without reducers. An error will occur upon compile.
C. There is a CPU intensive step that occurs between the map and reduce steps. Disabling the reduce step speeds up data processing.
D. It is not possible to create a map-reduce job without at least one reduce step. A developer may decide to limit to one reducer for debugging purposes.

Answer: C

How can you disable the reduce step in Hadoop?

A. The Hadoop administrator has to set the number of the reducer slot to zero on all slave nodes. This will disable the reduce step.
B. It is imposible to disable the reduce step since it is critical part of the Mep-Reduce abstraction.
C. A developer can always set the number of the reducers to zero. That will completely disable the reduce step.
D. While you cannot completely disable reducers you can set output to one. There needs to be at least one reduce step in Map-Reduce abstraction.

Answer: C

What is PIG? in Hadoop

A. Pig is a subset fo the Hadoop API for data processing
B. Pig is a part of the Apache Hadoop project that provides C-like scripting languge interface for data processing
C. Pig is a part of the Apache Hadoop project. It is a "PL-SQL" interface for data processing in Hadoop cluster
D. PIG is the third most popular form of meat in the US behind poultry and beef.

Answer: B

What is reduce - side join in Hadoop?

A. Reduce-side join is a technique to eliminate data from initial data set at reduce step
B. Reduce-side join is a technique for merging data from different sources based on a specific key. There are no memory restrictions
C. Reduce-side join is a set of API to merge data from different sources.
D. None of these answers are correct

Answer: B

What is map - side join in Hadoop?

A. Map-side join is done in the map phase and done in memory
B. Map-side join is a technique in which data is eliminated at the map step
C. Map-side join is a form of map-reduce API which joins data from different locations
D. None of these answers are correct

Answer: A

How can you use binary data in MapReduce in Hadoop?

A. Binary data can be used directly by a map-reduce job. Often binary data is added to a sequence file.
B. Binary data cannot be used by Hadoop fremework. Binary data should be converted to a Hadoop compatible format prior to loading.
C. Binary can be used in map-reduce only with very limited functionlity. It cannot be used as a key for example.
D. Hadoop can freely use binary files with map-reduce jobs so long as the files have headers

Answer: A

What are map files and why are they important in Hadoop?

A. Map files are stored on the namenode and capture the metadata for all blocks on a particular rack. This is how Hadoop is "rack aware"
B. Map files are the files that show how the data is distributed in the Hadoop cluster.
C. Map files are generated by Map-Reduce after the reduce step. They show the task distribution during job execution
D. Map files are sorted sequence files that also have an index. The index allows fast data look up.

Answer: D

What are sequence files and why are they important in Hadoop?

A. Sequence files are binary format files that are compressed and are splitable. They are often used in high-performance map-reduce jobs
B. Sequence files are a type of the file in the Hadoop framework that allow data to be sorted
C. Sequence files are intermediate files that are created by Hadoop after the map step
D. Both B and C are correct

Answer: A

How many states does Writable interface defines ___ in Hadoop?

A. Two
B. Four
C. Three
D. None of the above

Answer: A

Which method of the FileSystem object is used for reading a file in HDFS in Hadoop?

A. open()
B. access()
C. select()
D. None of the above

Answer: A

RPC means______. in Hadoop?

A. Remote processing call
B. Remote process call
C. Remote procedure call
D. None of the above

Answer: C

The switch given to “hadoop fs” command for detailed help ?

A. -show
B. -help
C. -?
D. None of the above

Answer: B

The size of block in HDFS in hadoop?

A. 512 bytes
B. 64 MB
C. 1024 KB
D. None of the above

Answer: B

Which MapReduce phase is theoretically able to utilize features of the underlying file system in order to optimize parallel execution in Hadoop?

A. Split
B. Map
C. Combine

Ans: A

What is the input to the Reduce function in Hadoop?

A. One key and a list of all values associated with that key.
B. One key and a list of some values associated with that key.
C. An arbitrarily sized list of key/value pairs.

Ans: A

How can a distributed filesystem such as HDFS provide opportunities for optimization of a MapReduce operation?

A. Data represented in a distributed filesystem is already sorted.
B. Distributed filesystems must always be resident in memory, which is much faster than disk.
C. Data storage and processing can be co-located on the same node, so that most input data relevant to Map or Reduce will be present on local disks or cache.
D. A distributed filesystem makes random access faster because of the presence of a dedicated node serving file metadata.

Ans: D

Which of the following MapReduce execution frameworks focus on execution in shared-memory environments?

A. Hadoop
B. Twister
C. Phoenix

Ans: C

What is the implementation language of the Hadoop MapReduce framework?

A. Java
B. C
C. FORTRAN
D. Python

Ans: A

The Combine stage, if present, must perform the same aggregation operation as Reduce ?
A. True
B. False

Ans: B

Which MapReduce stage serves as a barrier, where all previous stages must be completed before it may proceed?

A. Combine
B. Group (a.k.a. 'shuffle')
C. Reduce
D. Write

Ans: A

Which TACC resource has support for Hadoop MapReduce?

A. Ranger
B. Longhorn
C. Lonestar
D. Spur

Ans: A

Which of the following scenarios makes HDFS unavailable in Hadoop?

A. JobTracker failure
B. TaskTracker failure
C. DataNode failure
D. NameNode failure
E. Secondary NameNode failure

Answer: A

Which TACC resource has support for Hadoop MapReduce in Hadoop?

A. Ranger
B. Longhorn
C. Lonestar
D. Spur

Ans: A

Which MapReduce stage serves as a barrier, where all previous stages must be completed before it may proceed in Hadoop?

A. Combine
B. Group (a.k.a. 'shuffle')
C. Reduce
D. Write

Ans: A

Which of the following scenarios makes HDFS unavailable in Hadoop?

A. JobTracker failure
B. TaskTracker failure
C. DataNode failure
D. NameNode failure
E. Secondary NameNode failure

Answer: A

You are running a Hadoop cluster with all monitoring facilities properly configured. Which scenario will go undetected in Hadoop?

A. Map or reduce tasks that are stuck in an infinite loop.
B. HDFS is almost full.
C. The NameNode goes down.
D. A DataNode is disconnectedfrom the cluster.
E. MapReduce jobs that are causing excessive memory swaps.

Answer: C

Which of the following utilities allows you to create and run MapReduce jobs with any executable or script as the mapper and/or the reducer?

A. Oozie
B. Sqoop
C. Flume
D. Hadoop Streaming

Answer: D

You need a distributed, scalable, data Store that allows you random, realtime read/write access to hundreds of terabytes of data. Which of the following would you use in Hadoop?

A. Hue
B. Pig
C. Hive
D. Oozie
E. HBase
F. Flume
G. Sqoop

Answer: E

Workflows expressed in Oozie can contain in Hadoop?

A. Iterative repetition of MapReduce jobs until a desired answer or state is reached.
B. Sequences of MapReduce and Pig jobs. These are limited to linear sequences of actions with exception handlers but no forks.
C. Sequences of MapReduce jobs only; no Pig or Hive tasks or jobs. These MapReduce sequences can be combined with forks and path joins.
D. Sequences of MapReduce and Pig. These sequences can be combined with other actions including forks, decision points, and path joins.

Answer: D

You have an employee who is a Date Analyst and is very comfortable with SQL. He would like to run ad-hoc analysis on data in your HDFS duster. Which of the following is a data warehousing software built on top of Apache Hadoop that defines a simple SQL-like query language well-suited for this kind of user?

A. Pig
B. Hue
C. Hive
D. Sqoop
E. Oozie
F. Flume
G. Hadoop Streaming

Answer: C

Which of the following statements most accurately describes the relationship between MapReduce and Pig?

A. Pig provides additional capabilities that allow certain types of data manipulation not possible with MapReduce.
B. Pig provides no additional capabilities to MapReduce. Pig programs are executed as MapReduce jobs via the Pig interpreter.
C. Pig programs rely on MapReduce but are extensible, allowing developers to do special-purpose processing not provided by MapReduce.
D. Pig provides the additional capability of allowing you to control the flow of multiple MapReduce jobs.

Answer: D

In a MapReduce job, you want each of you input files processed by a single map task. How do you configure a MapReduce job so that a single map task processes each input file regardless of how many blocks the input file occupies?

A. Increase the parameter that controls minimum split size in the job configuration.
B. Write a custom MapRunner that iterates over all key-value pairs in the entire file.
C. Set the number of mappers equal to the number of input files you want to process.
D. Write a custom FileInputFormat and override the method isSplittable to always return false.

Answer: B

Which of the following best describes the workings of TextInputFormat in Hadoop?

A. Input file splits may cross line breaks. A line thatcrosses tile splits is ignored.
B. The input file is split exactly at the line breaks, so each Record Reader will read a series of complete lines.
C. Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReaders of both splits containing the brokenline.
D. Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReader of the split that contains the end of the brokenline.
E. Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReader of the split that contains the beginningof thebroken line.

Answer: D

What happens in a MapReduce job when you set the number of reducers to one?

A. A single reducer gathers and processes all the output from all the mappers. The output is written in as many separate files as there are mappers.
B. A single reducer gathers and processes all the output from all the mappers. The output is written to a single file in HDFS.
C. Setting the number of reducers to one creates a processing bottleneck, and since the number of reducers as specified by the programmer is used as a reference value only, the MapReduceruntime provides a default setting for the number of reducers.
D. Setting the number of reducers to one is invalid, and an exception is thrown.

Answer:B

In the standard word count MapReduce algorithm, why might using a combiner reduce the overall Job running time?

A. Because combiners perform local aggregation of word counts, thereby allowing the mappers to process input data faster.
B. Because combiners perform local aggregation of word counts, thereby reducing the number of mappers that need to run.
C. Because combiners perform local aggregation of word counts, and then transfer that data to reducers without writing the intermediate data to disk.
D. Because combiners perform local aggregation of word counts, thereby reducing the number of key-value pairs that need to be snuff let across the network to the reducers.

Answer:D

You have user profile records in your OLTP database,that you want to join with weblogs you have already ingested into HDFS.How will you obtain these user records?

A. HDFS commands
B. Pig load
C. Sqoop import
D. Hive

Answer :C

Big Data

Wednesday, March 26, 2014

CCD-410 Questions

Search Alpesh's Blog