MapReduce with Hadoop Streaming in bash – Part 3

Hadoop Streaming Bash

In our first MapReduce with Hadoop Streaming in bash article, we took a collection of Stephen Crane poems and used a MapReduce job to calculate ‘term frequency’–meaning we counted the number of times each word in the collection appeared in the collection. In the second part, we calculated ‘document frequency’ by counting the number of documents each word appears in using results from the first job.

For this final part, we will use the term frequency and document frequency to build the final Term Frequency/Inverse Document Frequency (TF-IDF) score. To do this, we need to fill our results into the TF-IDF algorithm.


This algorithm shows that TF-IDF equals the Term Frequency times the natural logarithm of total documents divided by document frequency. So for each term/file combination, we need to calculate the TF-IDF based on the values provided by our last MapReduce job. Some people prefer to use base 10 logarithm to dampen down the results–I’ll cover this in the Mapper section.


So the first thing I’m going to do is get the output from the last job. You should be used to this by now.

[training@localhost steve]$ hadoop fs -get crane_out2/part-00000

The next thing I’m going to do is cheat. See, the algorithm requires the total number of documents that we’ve been analyzing. Sure, I could write a MapReduce job that looks through our latest output and emits a list of unique files; however, this is very inefficient and a waste of resources. Using a simple ‘ls’ command with a glob is much more efficient and makes better use of our (pseudo)cluster. To figure out our total document count, we’ll do just that:

[training@localhost steve]$ hadoop fs -ls crane | tail -n +2 | wc -l

For our testing we’ll just explicitly set this as a variable. In the Hadoop job we’ll pass it in as a parameter.

On to the Mapper

So let’s go ahead and do our final calculation using the Mapper.

[training@localhost steve]$ cat maptfidf.sh 

while read term file tf df; do
  TFIDF=$(echo $N $df $tf | awk '{print $3 * log($1/$2)}')
  printf "%s\t%s\t%s\n" "$term" "$file" "$TFIDF"

Simpler than you thought? Let’s look at what we did.

  1. Read each line into the variables term, file, tf, and df. These represent (from our last job) a unique term/document combination, the number of times the term appeared in that document (tf), and the number of documents the term appears in (df).
  2. Calculate TFIDF using awk. We do this by passing total documents ($N, calculated with the variable I mentioned in the setup), document frequency, and term frequency. TF times log(total/DF) is the final answer.
  3. Print the final output as Term, File, and TF-IDF. Term and File make up the unique key for each line of output.

This is actually our final result. This is exactly what we’ve been trying to calculate and the product of our three jobs. As I mentioned in the intro to this article, some people prefer to use log10() instead of natural logarithm (which uses the constant e) to dampen the results like this:


If that’s the case, you can replace the awk line in the Mapper with this one:

TFIDF=$(echo $N $df $tf | awk '{print $3 * (log($1/$2)/log(10))}')

Let’s test it our in the shell, first setting the ‘N’ variable required for the algorithm:

[training@localhost steve]$ export N=`hadoop fs -ls crane | tail -n +2 | wc -l`
[training@localhost steve]$ cat part-00000 | ./maptfidf.sh | head -6
a	hdfs://	0
a	hdfs://	0
a	hdfs://	0
a	hdfs://	0
a	hdfs://	0
a	hdfs://	0

Not too descriptive with all those 0′s, but if you know how TF-IDF works you know it is working. The lower the value, the less relevant that word is. The letter ‘a’ appears in all 8 documents, making it a very irrelevant word. Usually words like ‘a’ or ‘and’ or ‘the’ would have been filtered out in the beginning via a stoplist.

What Reducer?

Screen Shot 2013-02-10 at 5.28.01 PMSince the Mapper produced our final output, we actually don’t need to worry about a reducer. We could specify no reducer (-reducer NONE in the options) but instead we’ll use something called the IdentityReducer. An IdentityReducer means that we want the reducer to take its input and just output it naturally with no calculation. This accomplishes two things: 1) the data is sorted/shuffled when it’s sent to the reducer, so with a single reducer it should come out sorted, and 2) we will get a single output file instead of one per mapper which is easier to work with later.

So you get a break this time. No reducer. Sweet!

Our Hadoop Command

[training@localhost steve]$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-D stream.num.map.output.key.fields=2 -D N=`hadoop fs \
-ls crane | tail -n +2 | wc -l` -input crane_out2 \
-output tfidf -mapper /home/training/steve/maptfidf.sh \
-reducer org.apache.hadoop.mapred.lib.IdentityReducer

Just as in the previous articles, the backslashes are just there to show this is a multiline command. If you put it all on one line you don’t need them.

So a few things to note here. First, we set the stream.num.map.output.key.fields variable to 2. Even though we don’t have a formal reducer, we still want to tell the job the key field count so it will sort properly. Second, we set a new variable (-D is required for each one) called ‘N’ to the result of an ‘ls’ command in Hadoop against our original document folder. This variable will be expressed as bash variable inside our shell script and denotes the total document count. The third thing is the -reducer setting. To use the identity reducer, set it to org.apache.hadoop.mapred.lib.IdentityReducer.

Running this command gives us the final job output and the save to the ‘tfidf’ folder under HDFS:

[training@localhost steve]$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -D stream.num.map.output.key.fields=2 -D N=8 -input crane_out2 -output tfidf -mapper /home/training/steve/maptfidf.sh -reducer org.apache.hadoop.mapred.lib.IdentityReducer
packageJobJar: [/tmp/hadoop-training/hadoop-unjar6684831878608134041/] [] /tmp/streamjob5486308040698764550.jar tmpDir=null
13/10/01 07:33:26 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/10/01 07:33:26 WARN snappy.LoadSnappy: Snappy native library is available
13/10/01 07:33:26 INFO snappy.LoadSnappy: Snappy native library loaded
13/10/01 07:33:26 INFO mapred.FileInputFormat: Total input paths to process : 1
13/10/01 07:33:26 INFO mapred.JobClient: Running job: job_201309292255_0066
13/10/01 07:33:27 INFO mapred.JobClient:  map 0% reduce 0%
13/10/01 07:33:32 INFO mapred.JobClient:  map 100% reduce 0%
13/10/01 07:33:35 INFO mapred.JobClient:  map 100% reduce 100%
13/10/01 07:33:36 INFO mapred.JobClient: Job complete: job_201309292255_0066
13/10/01 07:33:36 INFO mapred.JobClient: Counters: 33
13/10/01 07:33:36 INFO mapred.JobClient:   File System Counters
13/10/01 07:33:36 INFO mapred.JobClient:     FILE: Number of bytes read=24244
13/10/01 07:33:36 INFO mapred.JobClient:     FILE: Number of bytes written=421188
13/10/01 07:33:36 INFO mapred.JobClient:     FILE: Number of read operations=0
13/10/01 07:33:36 INFO mapred.JobClient:     FILE: Number of large read operations=0
13/10/01 07:33:36 INFO mapred.JobClient:     FILE: Number of write operations=0
13/10/01 07:33:36 INFO mapred.JobClient:     HDFS: Number of bytes read=22438
13/10/01 07:33:36 INFO mapred.JobClient:     HDFS: Number of bytes written=23586
13/10/01 07:33:36 INFO mapred.JobClient:     HDFS: Number of read operations=3
13/10/01 07:33:36 INFO mapred.JobClient:     HDFS: Number of large read operations=0
13/10/01 07:33:36 INFO mapred.JobClient:     HDFS: Number of write operations=2
13/10/01 07:33:36 INFO mapred.JobClient:   Job Counters 
13/10/01 07:33:36 INFO mapred.JobClient:     Launched map tasks=1
13/10/01 07:33:36 INFO mapred.JobClient:     Launched reduce tasks=1
13/10/01 07:33:36 INFO mapred.JobClient:     Data-local map tasks=1
13/10/01 07:33:36 INFO mapred.JobClient:     Total time spent by all maps in occupied slots (ms)=5617
13/10/01 07:33:36 INFO mapred.JobClient:     Total time spent by all reduces in occupied slots (ms)=3000
13/10/01 07:33:36 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
13/10/01 07:33:36 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
13/10/01 07:33:36 INFO mapred.JobClient:   Map-Reduce Framework
13/10/01 07:33:36 INFO mapred.JobClient:     Map input records=326
13/10/01 07:33:36 INFO mapred.JobClient:     Map output records=326
13/10/01 07:33:36 INFO mapred.JobClient:     Map output bytes=23586
13/10/01 07:33:36 INFO mapred.JobClient:     Input split bytes=108
13/10/01 07:33:36 INFO mapred.JobClient:     Combine input records=0
13/10/01 07:33:36 INFO mapred.JobClient:     Combine output records=0
13/10/01 07:33:36 INFO mapred.JobClient:     Reduce input groups=326
13/10/01 07:33:36 INFO mapred.JobClient:     Reduce shuffle bytes=24244
13/10/01 07:33:36 INFO mapred.JobClient:     Reduce input records=326
13/10/01 07:33:36 INFO mapred.JobClient:     Reduce output records=326
13/10/01 07:33:36 INFO mapred.JobClient:     Spilled Records=652
13/10/01 07:33:36 INFO mapred.JobClient:     CPU time spent (ms)=840
13/10/01 07:33:36 INFO mapred.JobClient:     Physical memory (bytes) snapshot=199655424
13/10/01 07:33:36 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=776904704
13/10/01 07:33:36 INFO mapred.JobClient:     Total committed heap usage (bytes)=176492544
13/10/01 07:33:36 INFO mapred.JobClient:   org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter
13/10/01 07:33:36 INFO mapred.JobClient:     BYTES_READ=22330
13/10/01 07:33:36 INFO streaming.StreamJob: Output directory: tfidf

The Final Results

After three MapReduce jobs, we’re finally ready to see our word/document and associated TF-IDF. Score! (ba dum tss)

[training@localhost steve]$ hadoop fs -cat tfidf/part-00000
a	hdfs://	0
a	hdfs://	0
a	hdfs://	0
a	hdfs://	0
a	hdfs://	0
a	hdfs://	0
a	hdfs://	0
a	hdfs://	0
accosted	hdfs://	2.07944
achieved	hdfs://	2.07944
addressed	hdfs://	2.07944
again	hdfs://	2.07944
ages	hdfs://	2.07944
agony	hdfs://	2.07944
ah	hdfs://	1.38629
ah	hdfs://	1.38629
already	hdfs://	2.07944
am	hdfs://	2.07944
and	hdfs://	0.267063
and	hdfs://	0.400594
and	hdfs://	0.400594
and	hdfs://	0.133531
and	hdfs://	0.267063
and	hdfs://	0.267063
and	hdfs://	0.133531
another's	hdfs://	2.07944
are	hdfs://	2.07944
as	hdfs://	2.07944
at	hdfs://	2.07944
aye	hdfs://	1.38629
aye	hdfs://	1.38629
ball	hdfs://	8.31777
bawled	hdfs://	2.07944
been	hdfs://	2.07944
before	hdfs://	2.07944
began	hdfs://	2.07944
believed	hdfs://	2.07944
black	hdfs://	1.38629
black	hdfs://	1.38629
blind	hdfs://	2.07944
book	hdfs://	4.15888
boys	hdfs://	2.07944
breath	hdfs://	4.15888
but	hdfs://	1.38629
but	hdfs://	1.38629
by	hdfs://	1.38629
by	hdfs://	5.54518
called	hdfs://	2.07944
calling	hdfs://	4.15888
can	hdfs://	2.07944
cavern	hdfs://	2.07944
child	hdfs://	4.15888
chronicle	hdfs://	2.07944
clay	hdfs://	2.07944
climbed	hdfs://	2.07944
collection	hdfs://	4.15888
concentrating	hdfs://	2.07944
court	hdfs://	2.07944
created	hdfs://	2.07944
crevice	hdfs://	2.07944
cried	hdfs://	1.38629
cried	hdfs://	2.77259
crowd	hdfs://	2.07944
crowned	hdfs://	2.07944
cuddle	hdfs://	2.07944
curious	hdfs://	2.07944
dead	hdfs://	2.07944
death	hdfs://	2.07944
deathslime	hdfs://	2.07944
denial	hdfs://	2.07944
desert	hdfs://	6.23832
dire	hdfs://	2.07944
disturbed	hdfs://	2.07944
earth	hdfs://	2.07944
echoes	hdfs://	2.07944
error	hdfs://	2.07944
eternal	hdfs://	2.07944
even	hdfs://	2.07944
eventually	hdfs://	1.38629
eventually	hdfs://	1.38629
ever	hdfs://	4.15888
every	hdfs://	2.07944
exist	hdfs://	2.07944
fact	hdfs://	2.07944
families	hdfs://	2.07944
feckless	hdfs://	2.07944
fenceless	hdfs://	2.07944
fireside	hdfs://	2.07944
fleetly	hdfs://	2.07944
for	hdfs://	0.980829
for	hdfs://	0.980829
for	hdfs://	0.980829
fortress	hdfs://	2.07944
freedom	hdfs://	2.07944
from	hdfs://	0.693147
from	hdfs://	1.38629
from	hdfs://	0.693147
from	hdfs://	0.693147
futile	hdfs://	2.07944
game	hdfs://	2.07944
garment	hdfs://	4.15888
god	hdfs://	13.8629
god	hdfs://	1.38629
gold	hdfs://	8.31777
grown	hdfs://	2.07944
had	hdfs://	2.07944
halfinjustices	hdfs://	2.07944
hand	hdfs://	2.07944
hands	hdfs://	2.07944
has	hdfs://	2.07944
have	hdfs://	1.38629
have	hdfs://	4.15888
he	hdfs://	1.38629
he	hdfs://	4.15888
he	hdfs://	2.77259
he	hdfs://	0.693147
heat	hdfs://	2.07944
heavens	hdfs://	2.07944
held	hdfs://	4.15888
hem	hdfs://	4.15888
highest	hdfs://	2.07944
him	hdfs://	2.77259
him	hdfs://	1.38629
his	hdfs://	1.38629
his	hdfs://	1.38629
hold	hdfs://	2.07944
honest	hdfs://	2.07944
horizon	hdfs://	1.38629
horizon	hdfs://	1.38629
however	hdfs://	2.07944
i	hdfs://	0.470004
i	hdfs://	2.82002
i	hdfs://	1.88001
i	hdfs://	2.35002
i	hdfs://	1.41001
in	hdfs://	0.287682
in	hdfs://	0.287682
in	hdfs://	0.287682
in	hdfs://	0.287682
in	hdfs://	0.287682
in	hdfs://	0.287682
into	hdfs://	2.07944
is	hdfs://	0.575364
is	hdfs://	2.01377
is	hdfs://	0.287682
is	hdfs://	0.287682
is	hdfs://	0.575364
is	hdfs://	0.575364
it	hdfs://	1.43841
it	hdfs://	0.287682
it	hdfs://	0.287682
it	hdfs://	0.287682
it	hdfs://	0.575364
it	hdfs://	0.575364
its	hdfs://	2.77259
its	hdfs://	4.15888
joys	hdfs://	2.07944
kindly	hdfs://	2.07944
know	hdfs://	2.07944
let	hdfs://	2.07944
lie	hdfs://	2.07944
life's	hdfs://	2.07944
lived	hdfs://	2.07944
lo	hdfs://	2.07944
lone	hdfs://	2.07944
long	hdfs://	2.07944
looked	hdfs://	2.07944
looks	hdfs://	2.07944
loud	hdfs://	2.07944
mad	hdfs://	2.07944
man	hdfs://	0.980829
man	hdfs://	1.96166
man	hdfs://	1.96166
market	hdfs://	2.07944
me	hdfs://	0.693147
me	hdfs://	1.38629
me	hdfs://	0.693147
me	hdfs://	0.693147
melons	hdfs://	2.07944
men	hdfs://	4.15888
merciful	hdfs://	2.07944
met	hdfs://	2.07944
mighty	hdfs://	2.07944
mile	hdfs://	4.15888
million	hdfs://	2.07944
mocked	hdfs://	2.07944
much	hdfs://	4.15888
never	hdfs://	1.38629
never	hdfs://	2.77259
newspaper	hdfs://	10.3972
night	hdfs://	2.07944
no	hdfs://	1.38629
no	hdfs://	2.77259
not	hdfs://	1.38629
not	hdfs://	1.38629
now	hdfs://	4.15888
obligation	hdfs://	2.07944
of	hdfs://	0.287682
of	hdfs://	1.15073
of	hdfs://	1.43841
of	hdfs://	0.863046
of	hdfs://	0.575364
of	hdfs://	0.575364
often	hdfs://	2.07944
on	hdfs://	2.07944
one	hdfs://	2.07944
opened	hdfs://	2.07944
opinion	hdfs://	2.07944
part	hdfs://	4.15888
phantom	hdfs://	4.15888
place	hdfs://	2.07944
plains	hdfs://	2.07944
player	hdfs://	2.07944
pursued	hdfs://	2.07944
pursuing	hdfs://	2.07944
ran	hdfs://	2.07944
read	hdfs://	2.07944
remote	hdfs://	2.07944
replied	hdfs://	2.07944
roaming	hdfs://	2.07944
rock	hdfs://	2.07944
round	hdfs://	4.15888
said	hdfs://	0.470004
said	hdfs://	0.470004
said	hdfs://	0.470004
said	hdfs://	0.940007
said	hdfs://	0.940007
sand	hdfs://	2.07944
saw	hdfs://	1.38629
saw	hdfs://	1.38629
scores	hdfs://	2.07944
screamed	hdfs://	2.07944
second	hdfs://	2.07944
seer	hdfs://	2.07944
sells	hdfs://	2.07944
sense	hdfs://	2.07944
shadow	hdfs://	4.15888
should	hdfs://	2.07944
sir	hdfs://	1.38629
sir	hdfs://	2.77259
skill	hdfs://	2.07944
sky	hdfs://	1.38629
sky	hdfs://	1.38629
smiled	hdfs://	2.07944
smote	hdfs://	2.07944
sneering	hdfs://	2.07944
so	hdfs://	2.07944
space	hdfs://	2.07944
spaces	hdfs://	2.07944
sped	hdfs://	2.77259
sped	hdfs://	1.38629
spirit	hdfs://	2.07944
spreads	hdfs://	2.07944
spurred	hdfs://	2.07944
squalor	hdfs://	2.07944
strange	hdfs://	2.77259
strange	hdfs://	1.38629
stupidities	hdfs://	2.07944
suddenly	hdfs://	2.07944
swift	hdfs://	2.07944
sword	hdfs://	2.07944
symbol	hdfs://	2.07944
take	hdfs://	2.07944
tale	hdfs://	2.07944
tales	hdfs://	2.07944
that	hdfs://	1.38629
that	hdfs://	4.15888
the	hdfs://	0
the	hdfs://	0
the	hdfs://	0
the	hdfs://	0
the	hdfs://	0
the	hdfs://	0
the	hdfs://	0
the	hdfs://	0
their	hdfs://	2.07944
then	hdfs://	1.38629
then	hdfs://	1.38629
there	hdfs://	1.38629
there	hdfs://	1.38629
they	hdfs://	2.07944
think	hdfs://	2.07944
this	hdfs://	1.96166
this	hdfs://	0.980829
this	hdfs://	0.980829
through	hdfs://	1.38629
through	hdfs://	2.77259
to	hdfs://	0.693147
to	hdfs://	0.693147
to	hdfs://	1.38629
to	hdfs://	2.07944
touched	hdfs://	4.15888
tower	hdfs://	2.07944
traveller	hdfs://	6.23832
tried	hdfs://	2.07944
truth	hdfs://	6.23832
unfairly	hdfs://	2.07944
unhaltered	hdfs://	2.07944
universe	hdfs://	4.15888
vacant	hdfs://	2.07944
valleys	hdfs://	2.07944
victory	hdfs://	2.07944
voice	hdfs://	4.15888
walked	hdfs://	2.07944
was	hdfs://	2.77259
was	hdfs://	0.693147
was	hdfs://	0.693147
was	hdfs://	0.693147
well	hdfs://	2.07944
went	hdfs://	1.38629
went	hdfs://	2.77259
when	hdfs://	1.38629
when	hdfs://	1.38629
whence	hdfs://	2.07944
where	hdfs://	6.23832
which	hdfs://	1.38629
which	hdfs://	1.38629
while	hdfs://	4.15888
wind	hdfs://	4.15888
wins	hdfs://	2.07944
wisdom	hdfs://	1.38629
wisdom	hdfs://	1.38629
world	hdfs://	1.38629
world	hdfs://	1.38629
you	hdfs://	1.38629
you	hdfs://	2.77259

Just as in our test, “a” is not important at all as it appears in 8 documents so it came out with a score of 0. “Where” seems very important for such a common word. But it turns out that is because it shows up 3 times in only 1 file. Remember, TF-IDF is “a numerical statistic which reflects how important a word is to a document in a collection or corpus”. Take a look at words like “you” at the end of the file–it shows up in two different files, but has a different TF-IDF weight for each one. That’s because “you” only appears once in the first file but twice in the second file, making it more important to that document in relation to the whole corpus. You can see this with ‘grep’ commands against the original content.

[training@localhost steve]$ grep -i you crane/met_a_seer.txt 
Of that which you hold. 
[training@localhost steve]$ grep -i you crane/pursuing_the_horizon.txt 
"You can never -- " 
"You lie," he cried,

And with that, we just built an index. If we were to build a search engine against those 8 Stephen Crane poems, then a search for a word would output file ordered by TF-IDF descending. That way the most pertinent (keyword rich) files would come first on the results.


Of course, we could have done this project a lot easier with tools like Lucene and Mahout. They are of course made for this sort of thing, and have a ton of extra features including automatic stoplisting, weight tuning, etc. But it wouldn’t be nearly as fun, right?

This concludes our TF-IDF with Hadoop Streaming in bash exercise. If you have any feedback on better ways to do these tasks (or errata) please let me know in the comments!

The post MapReduce with Hadoop Streaming in bash – Part 3 appeared first on Oracle Alchemist.

