Big Data

Thursday, September 15, 2016

Counters in Hadoop MapReduce

In this post I would like to explain the meaning of the Hadoop counters (the ones which you can generally see after the job completion). I have been analyzing the starvation of long running jobs in our relatively small cluster and Hadoop counters were of extreme importance in this investigation. Unfortunantely I could not find any resource which would explain in detail the meaning of those. In the table presented below, I am trying to describe in clear way what each of the counters means in Hadoop 2.6 release.

Counter Name	Counter Display Name	Detailed explanation
File System Counters
FILE_BYTES_READ	FILE: Number of bytes read	Amount of data read from local filesystem.
FILE_BYTES_WRITTEN	FILE: Number of bytes written	Amount of data written to local filesystem.
FILE_READ_OPS	FILE: Number of read operations	Number of read operations from local filesystem.
FILE_LARGE_READ_OPS	FILE: Number of large read operations	Number of read operations of large files from local filesystem (the ones which does not fit entirely into memory).
FILE_WRITE_OPS	FILE: Number of write operations	Number of write operations from local filesystem.
HDFS_BYTES_READ	HDFS: Number of bytes read	Amount of data read from HDFS.
HDFS_BYTES_WRITTEN	HDFS: Number of bytes written	Amount of data written to HDFS.
HDFS_READ_OPS	HDFS: Number of read operations	Number of read operations from HDFS.
HDFS_LARGE_READ_OPS	HDFS: Number of large read operations	Number of read operations of large files from HDFS (the ones which does not fit entirely into memory).
HDFS_WRITE_OPS	HDFS: Number of write operations	Number of write operations to HDFS.
Job Counters
TOTAL_LAUNCHED_MAPS	Launched map tasks	Total number of launched map tasks.
TOTAL_LAUNCHED_REDUCES	Launched reduce tasks	Total number of launched reduce tasks.
DATA_LOCAL_MAPS	Data-local map tasks	Number of map tasks which were launched on the nodes containing required data.
SLOTS_MILLIS_MAPS	Total time spent by all maps in occupied slots (ms)	Total time map tasks were executing.
SLOTS_MILLIS_REDUCES	Total time spent by all reduces in occupied slots (ms)	Total time reduce tasks were executing.
MILLIS_MAPS	Total time spent by all map tasks (ms)	Wall-time resources were occupied by mappers.
MILLIS_REDUCES	Total time spent by all reduce tasks (ms)	Wall-time resources were occupied by reducers.
VCORES_MILLIS_MAPS	Total vcore-seconds taken by all map tasks	Aggregated number of vCores that the mappers have allocated times the number of seconds the mappers have been running.
VCORES_MILLIS_REDUCES	Total vcore-seconds taken by all reduce tasks	Aggregated number of vCores that the reducers have allocated times the number of seconds the reducers have been running.
MB_MILLIS_MAPS	Total megabyte-seconds taken by all map tasks	Aggregated amount of memory (in megabytes) mappers have allocated times the number of seconds mappers have been running.
MB_MILLIS_REDUCES	Total megabyte-seconds taken by all reduce tasks	Aggregated amount of memory (in megabytes) reducers have allocated times the number of seconds reducers has have running.
Map-Reduce Framework
MAP_INPUT_RECORDS	Map input records	Total number of records processed by mappers.
MAP_OUTPUT_RECORDS	Map output records	Total number of records produced by mappers.
MAP_OUTPUT_BYTES	Map output bytes	Total amount of data produced by mappers.
MAP_OUTPUT_MATERIALIZED_BYTES	Map output materialized bytes	The amount of data which was actually written to disk (if the compression is enabled).
SPLIT_RAW_BYTES	Amount of data consumed for metadata representation during splits.
COMBINE_INPUT_RECORDS	Combine input records	Total number of records processed by combiners(if implemented).
COMBINE_OUTPUT_RECORDS	Combine output records	Total number of records produced by combiners(if implemented).
REDUCE_INPUT_GROUPS	Reduce input groups	Total number of unique keys.
REDUCE_SHUFFLE_BYTES	Reduce shuffle bytes
REDUCE_INPUT_RECORDS	Reduce input records	Total number of records processed by reducers.
REDUCE_OUTPUT_RECORDS	Reduce output records	Total number of records produced by reducers.
SPILLED_RECORDS	Spilled Records	Total number of records (by mappers and reducers) which were spilled to disk (in case when there is not enough memory).
SHUFFLED_MAPS	Shuffled Maps	Total number of mappers which undergone through shuffle phase.
FAILED_SHUFFLE	Failed Shuffles	Total number of mappers which failed to undergo through shuffle phase.
MERGED_MAP_OUTPUTS	Merged Map outputs	Total number of mapper output files undergone through shuffle phase.
GC_TIME_MILLIS	GC time elapsed (ms)	Wall-time spent for Garbage Collection.
CPU_MILLISECONDS	CPU time spent (ms)	Cumulative CPU time for all tasks.
PHYSICAL_MEMORY_BYTES	Physical memory (bytes) snapshot	Total physical memory used by all tasks including spilled data.
VIRTUAL_MEMORY_BYTES	Virtual memory (bytes) snapshot	Total virtual memory used by all tasks.
COMMITTED_HEAP_BYTES	Total committed heap usage (bytes)	Total amount of memory available for JVM.
Shuffle Errors
BAD_ID	BAD_ID	Total number of errors related with the intepretations of IDs from shuffle headers (mapper ID for example).
CONNECTION	CONNECTION	Source code does not reveal any usage for this counter.
IO_ERROR	IO_ERROR	Total number of errors related with reading and writing intermediate data.
WRONG_LENGTH	WRONG_LENGTH	Total number of errors relared with missbehaving compression and decompression of intermediate data.
WRONG_MAP	WRONG_MAP	Total number of errors related to duplication of the mapper output data (when framework tries to process already processed mapper output).
WRONG_REDUCE	WRONG_REDUCE	Total number of errors related to the attempts of shuffling data for wrong reducer (when shuffle for determined reducer tries to shuffle the data for different reducer).
File Input Format Counters
BYTES_READ	Bytes Read	Amount of data read by every tasks for every filesystem.
File Output Format Counters
BYTES_WRITTEN	Bytes Written	Amount of data written by every tasks for

Tuesday, September 13, 2016

Write files to Hadoop which are having space in the file name

Assume we have a file called RCA_10th Sep_Double Demand.docx in our local system and we want to move this to HDFS.

Commonly we write the HDFS statement like below:

hadoop fs -put /home/bandaru.nagesh_gmail/RCA_10th Sep_Double Demand.docx /user/bandaru.nagesh_gmail/RCA_10th Sep_Double Demand.docx

You will be getting below error:

put: `Demand.docx': No such file or directory

So, Let's try differently and provide an Quote marks for the path:

hadoop fs -put "/home/bandaru.nagesh_gmail/RCA_10th Sep_Double Demand.docx" "/user/bandaru.nagesh_gmail/RCA_10th Sep_Double Demand.docx"

Opps! You will be getting below error:

put: unexpected URISyntaxException

Solution:

Here is the solution for the issue, we need to replace the space with \%20

So that Hadoop recognize this as space and allow to run command.

so, lets see how this work now. run below command

hadoop fs -put /<Local Path>/RCA_10th\%20Sep_Double\%20Demand.docx /<HDFS Path>/RCA_10th_Sep_Double_Demand.docx
Wow! It works

Results: You can find the file moved to HDFS

Note: Destination file should be provided with out any space, but the source file can be provided with \%20 with replace of space

Let us try differently and try to put the space in the HDFS also and see what will happen:

hadoop fs -put /home/bandaru.nagesh_gmail/RCA_10th\%20Sep_Double\%20Demand.docx /user/bandaru.nagesh_gmail/RCA_10th_Sep\%20Double\%20Demand.docx

See below Results after above two statements:
We can see that file has been copied with \%20, so that is the reason I have explained to remove space in the HDFS file.

Monday, September 12, 2016

Difference between Hadoop OLD API and NEW API

Below are Difference between Hadoop OLD API (0.20) and New API (1.X or 2.X)

Diffrence	New API	OLD API
Mapper & Reducer	New API useing Mapper and Reducer asClass So can add a method (with a default implementation) to an abstract class without breaking old implementations of the class	IN OLD API used Mapper & Reduceer asInterface (still exist in New API as well)
Package	new API is in theorg.apache.hadoop.mapreduce package	old API can still be found inorg.apache.hadoop.mapred.
User Code to commnicate with MapReduce Syaterm	use “context” object to communicate with mapReduce system	JobConf, the *OutputCollector*, and theReporter object use for communicate with Map reduce System
Control Mapper and Reducer execution	new API allows both mappers and reducers to control the execution flow by overriding the run() method.	Controlling mappers by writing aMapRunnable, but no equivalent exists for reducers.
JOB control	Job control is done through the *JOB* classin New API	Job Control was done through JobClient (not exists in the new API)
Job Configuration	Job Configuration done throughConfiguration class via some of the helper methods on Job.	jobconf objet was use for Job configuration.which is extension of Configuration class. java.lang.Object extended by org.apache.hadoop.conf.Configuration extended by org.apache.hadoop.mapred.JobConf
OutPut file Name	In the new API map outputs are namedpart-m-nnnnn, and reduce outputs are named part-r-nnnnn (where nnnnn is an integer designating the part number, starting from zero).	in the old API both map and reduce outputs are named part-nnnnn
reduce() method passes values	In the new API, the reduce() method passes values as a java.lang.Iterable	In the Old API, the reduce() method passes values as a java.lang.Iterator