Thursday, September 15, 2016

Counters in Hadoop MapReduce

In this post I would like to explain the meaning of the Hadoop counters (the ones which you can generally see after the job completion). I have been analyzing the starvation of long running jobs in our relatively small cluster and Hadoop counters were of extreme importance in this investigation. Unfortunantely I could not find any resource which would explain in detail the meaning of those. In the table presented below, I am trying to describe in clear way what each of the counters means in Hadoop 2.6 release.
Counter NameCounter Display NameDetailed explanation
File System Counters
FILE_BYTES_READFILE: Number of bytes readAmount of data read from local filesystem.
FILE_BYTES_WRITTENFILE: Number of bytes writtenAmount of data written to local filesystem.
FILE_READ_OPSFILE: Number of read operationsNumber of read operations from local filesystem.
FILE_LARGE_READ_OPSFILE: Number of large read operationsNumber of read operations of large files from local filesystem (the ones which does not fit entirely into memory).
FILE_WRITE_OPSFILE: Number of write operationsNumber of write operations from local filesystem.
HDFS_BYTES_READHDFS: Number of bytes readAmount of data read from HDFS.
HDFS_BYTES_WRITTENHDFS: Number of bytes writtenAmount of data written to HDFS.
HDFS_READ_OPSHDFS: Number of read operationsNumber of read operations from HDFS.
HDFS_LARGE_READ_OPSHDFS: Number of large read operationsNumber of read operations of large files from HDFS (the ones which does not fit entirely into memory).
HDFS_WRITE_OPSHDFS: Number of write operationsNumber of write operations to HDFS.
Job Counters
TOTAL_LAUNCHED_MAPSLaunched map tasksTotal number of launched map tasks.
TOTAL_LAUNCHED_REDUCESLaunched reduce tasksTotal number of launched reduce tasks.
DATA_LOCAL_MAPSData-local map tasksNumber of map tasks which were launched on the nodes containing required data.
SLOTS_MILLIS_MAPSTotal time spent by all maps in occupied slots (ms)Total time map tasks were executing.
SLOTS_MILLIS_REDUCESTotal time spent by all reduces in occupied slots (ms)Total time reduce tasks were executing.
MILLIS_MAPSTotal time spent by all map tasks (ms)Wall-time resources were occupied by mappers.
MILLIS_REDUCESTotal time spent by all reduce tasks (ms)Wall-time resources were occupied by reducers.
VCORES_MILLIS_MAPSTotal vcore-seconds taken by all map tasksAggregated number of vCores that the mappers have allocated times the number of seconds the mappers have been running.
VCORES_MILLIS_REDUCESTotal vcore-seconds taken by all reduce tasksAggregated number of vCores that the reducers have allocated times the number of seconds the reducers have been running.
MB_MILLIS_MAPSTotal megabyte-seconds taken by all map tasksAggregated amount of memory (in megabytes) mappers have allocated times the number of seconds mappers have been running.
MB_MILLIS_REDUCESTotal megabyte-seconds taken by all reduce tasksAggregated amount of memory (in megabytes) reducers have allocated times the number of seconds reducers has have running.
Map-Reduce Framework
MAP_INPUT_RECORDSMap input recordsTotal number of records processed by mappers.
MAP_OUTPUT_RECORDSMap output recordsTotal number of records produced by mappers.
MAP_OUTPUT_BYTESMap output bytesTotal amount of data produced by mappers.
MAP_OUTPUT_MATERIALIZED_BYTESMap output materialized bytesThe amount of data which was actually written to disk (if the compression is enabled).
SPLIT_RAW_BYTESAmount of data consumed for metadata representation during splits.
COMBINE_INPUT_RECORDSCombine input recordsTotal number of records processed by combiners(if implemented).
COMBINE_OUTPUT_RECORDSCombine output recordsTotal number of records produced by combiners(if implemented).
REDUCE_INPUT_GROUPSReduce input groupsTotal number of unique keys.
REDUCE_SHUFFLE_BYTESReduce shuffle bytes
REDUCE_INPUT_RECORDSReduce input recordsTotal number of records processed by reducers.
REDUCE_OUTPUT_RECORDSReduce output recordsTotal number of records produced by reducers.
SPILLED_RECORDSSpilled RecordsTotal number of records (by mappers and reducers) which were spilled to disk (in case when there is not enough memory).
SHUFFLED_MAPSShuffled MapsTotal number of mappers which undergone through shuffle phase.
FAILED_SHUFFLEFailed ShufflesTotal number of mappers which failed to undergo through shuffle phase.
MERGED_MAP_OUTPUTSMerged Map outputsTotal number of mapper output files undergone through shuffle phase.
GC_TIME_MILLISGC time elapsed (ms)Wall-time spent for Garbage Collection.
CPU_MILLISECONDSCPU time spent (ms)Cumulative CPU time for all tasks.
PHYSICAL_MEMORY_BYTESPhysical memory (bytes) snapshotTotal physical memory used by all tasks including spilled data.
VIRTUAL_MEMORY_BYTESVirtual memory (bytes) snapshotTotal virtual memory used by all tasks.
COMMITTED_HEAP_BYTESTotal committed heap usage (bytes)Total amount of memory available for JVM.
Shuffle Errors
BAD_IDBAD_IDTotal number of errors related with the intepretations of IDs from shuffle headers (mapper ID for example).
CONNECTIONCONNECTIONSource code does not reveal any usage for this counter.
IO_ERRORIO_ERRORTotal number of errors related with reading and writing intermediate data.
WRONG_LENGTHWRONG_LENGTHTotal number of errors relared with missbehaving compression and decompression of intermediate data.
WRONG_MAPWRONG_MAPTotal number of errors related to duplication of the mapper output data (when framework tries to process already processed mapper output).
WRONG_REDUCEWRONG_REDUCETotal number of errors related to the attempts of shuffling data for wrong reducer (when shuffle for determined reducer tries to shuffle the data for different reducer).
File Input Format Counters
BYTES_READBytes ReadAmount of data read by every tasks for every filesystem.
File Output Format Counters
BYTES_WRITTENBytes WrittenAmount of data written by every tasks for 

Tuesday, September 13, 2016

Write files to Hadoop which are having space in the file name

Assume we have a file called RCA_10th Sep_Double Demand.docx in our local system and we want to move this to HDFS.

Commonly we write the HDFS statement like below:

hadoop fs -put /home/bandaru.nagesh_gmail/RCA_10th Sep_Double Demand.docx /user/bandaru.nagesh_gmail/RCA_10th Sep_Double Demand.docx

You will be getting below error:

put: `Demand.docx': No such file or directory


So, Let's try differently and provide an Quote marks for the path:

hadoop fs -put "/home/bandaru.nagesh_gmail/RCA_10th Sep_Double Demand.docx" "/user/bandaru.nagesh_gmail/RCA_10th Sep_Double Demand.docx"

Opps! You will be getting below error:

put: unexpected URISyntaxException

Solution:

Here is the solution for the issue, we need to replace the space with \%20
So that Hadoop recognize this as space and allow to run command.
so, lets see how this work now. run below command

hadoop fs -put /<Local Path>/RCA_10th\%20Sep_Double\%20Demand.docx /<HDFS Path>/RCA_10th_Sep_Double_Demand.docx
Wow! It works

Results: You can find the file moved to HDFS

Note: Destination file should be provided with out any space, but the source file can be provided with \%20 with replace of space





Let us try differently and try to put the space in the HDFS also and see what will happen:

hadoop fs -put /home/bandaru.nagesh_gmail/RCA_10th\%20Sep_Double\%20Demand.docx /user/bandaru.nagesh_gmail/RCA_10th_Sep\%20Double\%20Demand.docx

See below Results after above two statements:
We can see that file has been copied with \%20, so that is the reason I have explained to remove space in the HDFS file.



Monday, September 12, 2016

Difference between Hadoop OLD API and NEW API

Below are Difference  between Hadoop OLD API (0.20) and New API (1.X or 2.X)




DiffrenceNew APIOLD API
Mapper & ReducerNew API useing Mapper and Reducer asClass 
So can add a method (with a default implementation) to an
abstract class without breaking old implementations of the class
IN OLD API used Mapper & Reduceer asInterface (still exist in New API as well)
Packagenew API is in theorg.apache.hadoop.mapreduce packageold API can still be found inorg.apache.hadoop.mapred.
User Code to commnicate with MapReduce Syatermuse “context” object to communicate with mapReduce systemJobConf, the OutputCollector, and theReporter object use for communicate with Map reduce System
Control Mapper and Reducer executionnew API allows both mappers and reducers to control the execution
flow by overriding the run() method.
Controlling mappers by writing aMapRunnable, but no
equivalent exists for reducers.
JOB controlJob control is done through the JOB classin New APIJob Control was done through JobClient
(not exists in the new API)
Job ConfigurationJob Configuration done throughConfiguration class via some of
the helper methods on Job.
jobconf objet was use for Job configuration.which is extension of Configuration class.

java.lang.Object
extended by org.apache.hadoop.conf.Configuration
extended by org.apache.hadoop.mapred.JobConf
OutPut file NameIn the new API map outputs are namedpart-m-nnnnn, and reduce outputs are named part-r-nnnnn (where nnnnn is an integer
designating the part number, starting from zero).
in the old API both map and reduce
outputs are named part-nnnnn
reduce() method passes valuesIn the new API, the reduce() method passes values as a java.lang.IterableIn the Old API, the reduce() method passes values as a java.lang.Iterator