<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Hadoop &#8211; Prologika</title>
	<atom:link href="https://prologika.com/tag/hadoop/feed/" rel="self" type="application/rss+xml" />
	<link>https://prologika.com</link>
	<description>Business Intelligence Consulting and Training in Atlanta</description>
	<lastBuildDate>Tue, 16 Feb 2021 08:52:32 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>
	<item>
		<title>Programming MapReduce Jobs with HDInsight Server for Windows</title>
		<link>https://prologika.com/programming-mapreduce-jobs-with-hdinsight-server-for-windows/</link>
					<comments>https://prologika.com/programming-mapreduce-jobs-with-hdinsight-server-for-windows/#comments</comments>
		
		<dc:creator><![CDATA[Prologika - Teo Lachev]]></dc:creator>
		<pubDate>Fri, 28 Dec 2012 20:18:00 +0000</pubDate>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[Hadoop]]></category>
		<guid isPermaLink="false">/CS/blogs/blog/archive/2012/12/28/programming-mapreduce-jobs-with-hdinsight-server-for-windows.aspx</guid>

					<description><![CDATA[In a previous blog &#8220;Installing HDInsight Server for Windows&#8221;, I introduced you to the Microsoft HDInsight Server for Windows. Recall that HDInsight Server for Windows is a Windows-based Hadoop distribution [&#8230;]]]></description>
										<content:encoded><![CDATA[<p>In a previous <a href="/CS/blogs/blog/archive/2012/10/31/installing-hdinsight-server-for-windows.aspx">blog</a> &#8220;Installing HDInsight Server for Windows&#8221;, I introduced you to the Microsoft HDInsight Server for Windows. Recall that HDInsight Server for Windows is a Windows-based Hadoop distribution that offers two main benefits for Big Data customers:</p>
<ul>
<li>An officially supported Hadoop distribution on Windows server – Previously, you can set up Hadoop on Windows as an unsupported installation (via Cygwin) for development purposes. What this means for you is that you can now set up a Hadoop cluster on servers running Windows Server OS.</li>
<li>Extends the reach of the Hadoop ecosystem to .NET developers by allowing them to write MapReduce jobs in .NET code, such as C#.</li>
</ul>
<p>And, in previous <a href="#">blogs</a>, I&#8217;ve introduced you to Hadoop. Recall that there are two main reasons for using Hadoop for storing and processing Big Data:</p>
<ul>
<li>Storage – You can store massive files in a distributed and fault-tolerant file system (HDFS) without worrying that hardware failure will result in a loss of data.</li>
<li>Distributed processing – When you outgrows the limitations of a single server, you can distribute job processing across the nodes in a Hadoop cluster. This allows you to perform crude data analysis directly on files stored in HDFS or execute any other type of jobs that can benefit from a parallel execution.</li>
</ul>
<p>This blog continues the HDInsight Server for Windows journey. As many of you probably don&#8217;t have experience in Unix or Java, I&#8217;ll show you how HDInsight makes it easy to write MapReduce jobs on a Windows machine.</p>
<p style="background: #f2f2f2;"><strong>Note</strong> Writing MapReduce jobs can be complex. If all you need is performing some crude data analysis, you should consider an abstraction layer, such as <a href="/CS/blogs/blog/archive/2012/06/24/ms-guy-does-hadoop-part-3-hive.aspx">Hive</a>, which is capable for deriving the schema and generating the MapReduce jobs for you. This doesn&#8217;t mean that experience in MapReduce is not useful. When processing the files go beyond just imposing a schema on the data and querying the results , you might need programming logic, such as in <a href="http://open.blogs.nytimes.com/2008/05/21/the-new-york-times-archives-amazon-web-services-timesmachine/">The New York Times Archive</a> case.</p>
<p>As a prerequisite, I installed HDInsight on my Windows 8 laptop. Because of its prerelease status, the CTP of HDInsight Server for Windows currently supports a single node only which is fine for development and testing. My task is to analyze the same dataset that I used in the MS BI Guy Does Hadoop (Part 2 – Taking Hadoop for a Spin) <a href="/CS/blogs/blog/archive/2012/06/09/ms-bi-guy-does-hadoop-part-2-taking-hadoop-for-a-spin.aspx">blog</a>. The dataset (temp.txt) contains temperature readings from weather stations around the world and it represents the weather datasets kept by <a href="http://www.ncdc.noaa.gov/oa/ncdc.html">National Climatic Data Center (NCDC)</a>. You will find the sample dataset in the <a href="#" target="_blank">source code</a> attached to this blog. It has the following content (the most important parts are highlighted in red: the year found in offset 15 and temperature found in offset 88).</p>
<p><span style="font-size: 9pt;">006701199099999<span style="color: red;"><strong>1950</strong></span>051507004+68750+023550FM-12+038299999V0203301N00671220001CN9999999N9+0<span style="color: red;">0<strong>00</strong></span>1+99999999999 </span></p>
<p><span style="font-size: 9pt;">004301199099999<span style="color: red;"><strong>1950</strong></span>051512004+68750+023550FM-12+038299999V0203201N00671220001CN9999999N9+0<span style="color: red;">0<strong>22</strong></span>1+99999999999 </span></p>
<p><span style="font-size: 9pt;">004301199099999<span style="color: red;"><strong>1950</strong></span>051518004+68750+023550FM-12+038299999V0203201N00261220001CN9999999N9-0<span style="color: red;">0<strong>11</strong></span>1+99999999999 </span></p>
<p><span style="font-size: 9pt;">004301265099999<span style="color: red;"><strong>1949</strong></span>032412004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+0<span style="color: red;">1<strong>11</strong></span>1+99999999999 </span></p>
<p><span style="font-size: 9pt;">004301265099999<span style="color: red;"><strong>1949</strong></span>032418004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+0<span style="color: red;">0<strong>78</strong></span>1+99999999999 </span></p>
<h1><span style="font-size: 11pt;">Note that the data is stored in its raw format and no schema was imposed on the data. The schema will be derived at runtime by parsing the file content. </span></h1>
<h1>Installing Microsoft .NET SDK for Hadoop</h1>
<p>The <a href="http://hadoopsdk.codeplex.com/">Microsoft .NET SDK for Hadoop</a> facilitates the programming effort required to code MapReduce jobs in .NET. To install it:</p>
<ol>
<li>Install <a href="http://docs.nuget.org/docs/start-here/installing-nuget">NuGet</a> first. NuGet is a Visual Studio extension that makes it easy to add, remove, and update libraries and tools in Visual Studio projects that use the .NET Framework.</li>
<li>Open Visual Studio (2010 or 2012) and create a new C# Class Library project.</li>
<li>Go to Tools <span style="font-family: Wingdings;">ð</span> Library Package Manager <span style="font-family: Wingdings;">ð</span> Package Manager Console.</li>
<li>
<div>In the Package Manager Console window that opens in the bottom of the screen, enter:<br />
<span style="color: #253340; font-family: Consolas; font-size: 9pt;">install-package Microsoft.Hadoop.MapReduce –pre</span></div>
<p>This command will download the required Hadoop binaries and add them as references in your project.</li>
</ol>
<h1>Coding the Map Job</h1>
<p>The Map job is responsible for parsing the input (the weather dataset), deriving the schema from it, and generating a key-value pair for the data that we&#8217;re interested in. In our case, the key will be the year and the value will be the temperature measure for that year. The Map class derives from the MapperBase class defined in Microsoft.Hadoop.MapReduce.dll.</p>
<p><img fetchpriority="high" decoding="async" class="alignnone wp-image-2135 size-full" src="/wp-content/uploads/2012/12/122812_2018_Programming1.png" alt="122812_2018_Programming1" width="505" height="393" srcset="https://prologika.com/wp-content/uploads/2012/12/122812_2018_Programming1.png 505w, https://prologika.com/wp-content/uploads/2012/12/122812_2018_Programming1-450x350.png 450w, https://prologika.com/wp-content/uploads/2012/12/122812_2018_Programming1-300x233.png 300w" sizes="(max-width: 505px) 100vw, 505px" /></p>
<p>At runtime, HDInsight will parse the file content and invoke the Map method once for each line in the file. In our case, the Map job is simple. We parse the input and extract the temperature and year. If the parsing operation is successful, we return the key-value pair. The end result will look like this:</p>
<p><span style="color: #253340; font-family: Consolas; font-size: 9pt;">(1950, 0) </span></p>
<p><span style="color: #253340; font-family: Consolas; font-size: 9pt;">(1950, 22) </span></p>
<p><span style="color: #253340; font-size: 9pt;"><span style="font-family: Consolas;">(1950, </span><span style="font-family: Times New Roman;">−</span><span style="font-family: Consolas;">11) </span></span></p>
<p><span style="color: #253340; font-family: Consolas; font-size: 9pt;">(1949, 111) </span></p>
<p><span style="color: #253340; font-family: Consolas; font-size: 9pt;">(1949, 78) </span></p>
<h1>Coding the Reduce Job</h1>
<p>Suppose that we want to get the maximum temperature for each year. Because each weather station might have multiple readings (lines in the input file) for the same year, we need to combine the results and find the maximum year. This is analogous to GROUP BY in SQL. The following Reduce job gets the work done:</p>
<p><img decoding="async" class="alignnone wp-image-2136 size-full" src="/wp-content/uploads/2012/12/122812_2018_Programming2.png" alt="122812_2018_Programming2" width="555" height="250" srcset="https://prologika.com/wp-content/uploads/2012/12/122812_2018_Programming2.png 555w, https://prologika.com/wp-content/uploads/2012/12/122812_2018_Programming2-450x203.png 450w, https://prologika.com/wp-content/uploads/2012/12/122812_2018_Programming2-300x135.png 300w" sizes="(max-width: 555px) 100vw, 555px" /></p>
<p>The Reduce job is even simpler. The Hadoop framework pre-processed the output of the Map jobs before it&#8217;s sent to the Reduce function. This processing sorts and groups the key-value pairs by key, so the input to the Reduce job will look like this:</p>
<p><span style="color: #253340; font-family: Consolas; font-size: 9pt;">(1949, [111, 78]) </span></p>
<p><span style="color: #253340; font-family: Consolas; font-size: 9pt;">(1950, [0, 22, −11]) </span></p>
<p>In our case, the only thing left for the Reduce job is to loop through the values for a given key (year) and return the maximum value, so the final output will be:</p>
<p><span style="color: #253340; font-family: Consolas; font-size: 9pt;">(1949, 111) </span></p>
<p><span style="color: #253340; font-family: Consolas; font-size: 9pt;">(1950, 22) </span></p>
<h1>Testing MapReduce</h1>
<p>Instead of deploying to Hadoop each time you make a change during the development and testing lifecycle, you can add another project, such as a Console Application, and use it as a test harness to test the MapReduce code. For your convenience, Microsoft provides a StreamingUnit class in Microsoft.Hadoop.MapReduce.dll. Here is what our test harness code looks like:</p>
<p><img decoding="async" class="alignnone wp-image-2138 size-full" src="/wp-content/uploads/2012/12/122812_2018_Programming3.png" alt="122812_2018_Programming3" width="499" height="377" srcset="https://prologika.com/wp-content/uploads/2012/12/122812_2018_Programming3.png 499w, https://prologika.com/wp-content/uploads/2012/12/122812_2018_Programming3-450x340.png 450w, https://prologika.com/wp-content/uploads/2012/12/122812_2018_Programming3-300x227.png 300w" sizes="(max-width: 499px) 100vw, 499px" /></p>
<p>The code uses a test input file. It reads the content of the file one line at the time and adds each line as a new element to an instance of ArrayList. Then, the code calls the StreamInsight.Execute method to initiate the MapReduce job.</p>
<h1>Deploying to Hadoop</h1>
<p>Once the code is tested, it&#8217;s time to deploy the dataset and MapReduce jobs to Hadoop.</p>
<ol>
<li>Deploy the file to the Hadoop HDFS file system.<br />
<span style="font-family: Courier New;">C:\Hadoop\hadoop-1.1.0-SNAPSHOT\bin&gt;hadoop fs -copyFromLocal D:\MyApp\Hadoop\MapReduce\temp.txt input/Temp/input.txt</span></li>
</ol>
<p style="background: #f2f2f2;"><strong>Note</strong> When you execute the hadoop command shell in the previous step, the file will be uploaded to your folder. However, if you use the JavaScript interactive console found in the HDInsight Dashboard, the file will be uploaded to the Hadoop folder in HDFS because the console runs under the hadoop user. Consequently, the MapReduce job won&#8217;t be able to find the file. So, you use the hadoop command prompt.</p>
<p>      2.   Browse the file system using the web interface (<a href="http://localhost:50070">http://localhost:50070</a>) to see that the file is in your folder.</p>
<p><img loading="lazy" decoding="async" class="alignnone wp-image-2140 size-full" src="/wp-content/uploads/2012/12/122812_2018_Programming4.png" alt="122812_2018_Programming4" width="505" height="157" srcset="https://prologika.com/wp-content/uploads/2012/12/122812_2018_Programming4.png 505w, https://prologika.com/wp-content/uploads/2012/12/122812_2018_Programming4-450x140.png 450w, https://prologika.com/wp-content/uploads/2012/12/122812_2018_Programming4-300x93.png 300w" sizes="auto, (max-width: 505px) 100vw, 505px" /></p>
<p>3.     Finally, we need to execute the job with HadoopJobExecutor, which be called in various ways. The easiest way is to use MRRunner<br />
<span style="font-family: Courier New;">D:\MyApp\Hadoop\MapReduce\FirstJob\bin\Debug&gt;.\mrlib\mrrunner -dll FirstJob.dll</span></p>
<p><span style="font-family: Courier New; font-size: 8pt;">D:\MyApp\Hadoop\MapReduce\FirstJob\bin\Debug&gt;.\mrlib\mrrunner -dll FirstJob.dll </span></p>
<p><span style="font-family: Courier New; font-size: 8pt;">File dependencies to include with job:[Auto-detected] D:\MyApp\Hadoop\MapReduce\FirstJob\bin\Debug\FirstJob.dll </span></p>
<p><span style="font-family: Courier New; font-size: 8pt;">[Auto-detected] D:\MyApp\Hadoop\MapReduce\FirstJob\bin\Debug\Microsoft.Hadoop.MapReduce.dll </span></p>
<p><span style="font-family: Courier New; font-size: 8pt;">[Auto-detected] D:\MyApp\Hadoop\MapReduce\FirstJob\bin\Debug\Newtonsoft.Json.dll </span></p>
<p><span style="font-family: Courier New; font-size: 8pt;">&gt;&gt;CMD: c:\hadoop\hadoop-1.1.0-SNAPSHOT\bin\hadoop.cmd jar c:\hadoop\hadoop-1.1.0-SNAPSHOT\lib\hadoop-streaming.jar -D &#8220;mapred.map.max.attempts=1&#8221; -D &#8220;mapred.reduce.max.attempts=1&#8221; -input inpu </span></p>
<p><span style="font-family: Courier New; font-size: 8pt;">emp -mapper ..\..\jars\Microsoft.Hadoop.MapDriver.exe -reducer ..\..\jars\Microsoft.Hadoop.ReduceDriver.exe -file D:\MyApp\Hadoop\MapReduce\FirstJob\bin\Debug\MRLib\Microsoft.Hadoop.MapDriver.e </span></p>
<p><span style="font-family: Courier New; font-size: 8pt;">p\MapReduce\FirstJob\bin\Debug\MRLib\Microsoft.Hadoop.ReduceDriver.exe -file D:\MyApp\Hadoop\MapReduce\FirstJob\bin\Debug\MRLib\Microsoft.Hadoop.CombineDriver.exe -file &#8220;D:\MyApp\Hadoop\MapRedu </span></p>
<p><span style="font-family: Courier New; font-size: 8pt;">irstJob.dll&#8221; -file &#8220;D:\MyApp\Hadoop\MapReduce\FirstJob\bin\Debug\Microsoft.Hadoop.MapReduce.dll&#8221; -file &#8220;D:\MyApp\Hadoop\MapReduce\FirstJob\bin\Debug\Newtonsoft.Json.dll&#8221; -cmdenv &#8220;MSFT_HADOOP_MA </span></p>
<p><span style="font-family: Courier New; font-size: 8pt;">-cmdenv &#8220;MSFT_HADOOP_MAPPER_TYPE=FirstJob.TemperatureMapper&#8221; -cmdenv &#8220;MSFT_HADOOP_REDUCER_DLL=FirstJob.dll&#8221; -cmdenv &#8220;MSFT_HADOOP_REDUCER_TYPE=FirstJob.TemperatureReducer&#8221; </span></p>
<p><span style="font-family: Courier New; font-size: 8pt;">packageJobJar: [D:\MyApp\Hadoop\MapReduce\FirstJob\bin\Debug\MRLib\Microsoft.Hadoop.MapDriver.exe, D:\MyApp\Hadoop\MapReduce\FirstJob\bin\Debug\MRLib\Microsoft.Hadoop.ReduceDriver.exe, D:\MyApp </span></p>
<p><span style="font-family: Courier New; font-size: 8pt;">Job\bin\Debug\MRLib\Microsoft.Hadoop.CombineDriver.exe, D:\MyApp\Hadoop\MapReduce\FirstJob\bin\Debug\FirstJob.dll, D:\MyApp\Hadoop\MapReduce\FirstJob\bin\Debug\Microsoft.Hadoop.MapReduce.dll, D </span></p>
<p><span style="font-family: Courier New; font-size: 8pt;">e\FirstJob\bin\Debug\Newtonsoft.Json.dll] [/C:/Hadoop/hadoop-1.1.0-SNAPSHOT/lib/hadoop-streaming.jar] C:\Users\Teo\AppData\Local\Temp\streamjob7017247708817804198.jar tmpDir=null </span></p>
<p><span style="font-family: Courier New; font-size: 8pt;">12/12/28 12:35:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform&#8230; using builtin-java classes where applicable </span></p>
<p><span style="font-family: Courier New; font-size: 8pt;">log4j:ERROR Failed to rename [C:\Hadoop\hadoop-1.1.0-SNAPSHOT\logs/hadoop.log] to [C:\Hadoop\hadoop-1.1.0-SNAPSHOT\logs/hadoop.log.2012-12-27]. </span></p>
<p><span style="font-family: Courier New; font-size: 8pt;">12/12/28 12:35:20 WARN snappy.LoadSnappy: Snappy native library not loaded </span></p>
<p><span style="font-family: Courier New; font-size: 8pt;">12/12/28 12:35:20 INFO mapred.FileInputFormat: Total input paths to process : 1 </span></p>
<p><span style="font-family: Courier New; font-size: 8pt;">12/12/28 12:35:20 INFO streaming.StreamJob: getLocalDirs(): [c:\hadoop\hdfs\mapred\local] </span></p>
<p><span style="font-family: Courier New; font-size: 8pt;">12/12/28 12:35:20 INFO streaming.StreamJob: Running job: job_201212271510_0010 </span></p>
<p><span style="font-family: Courier New; font-size: 8pt;">12/12/28 12:35:20 INFO streaming.StreamJob: To kill this job, run: </span></p>
<p><span style="font-family: Courier New; font-size: 8pt;">12/12/28 12:35:20 INFO streaming.StreamJob: C:\Hadoop\hadoop-1.1.0-SNAPSHOT/bin/hadoop job -Dmapred.job.tracker=localhost:50300 -kill job_201212271510_0010 </span></p>
<p><span style="font-family: Courier New; font-size: 8pt;">12/12/28 12:35:20 INFO streaming.StreamJob: Tracking URL: http://127.0.0.1:50030/jobdetails.jsp?jobid=job_201212271510_0010 </span></p>
<p><span style="font-family: Courier New; font-size: 8pt;">12/12/28 12:35:21 INFO streaming.StreamJob: map 0% reduce 0% </span></p>
<p><span style="font-family: Courier New; font-size: 8pt;">12/12/28 12:35:38 INFO streaming.StreamJob: map 100% reduce 0% </span></p>
<p><span style="font-family: Courier New; font-size: 8pt;">12/12/28 12:35:50 INFO streaming.StreamJob: map 100% reduce 100% </span></p>
<p><span style="font-family: Courier New; font-size: 8pt;">12/12/28 12:35:56 INFO streaming.StreamJob: Job complete: job_201212271510_0010 </span></p>
<p><span style="font-family: Courier New; font-size: 8pt;">12/12/28 12:35:56 INFO streaming.StreamJob: Output: output/Temp </span></p>
<p>4.   Using the web interface or the JavaScript console, go to the output folder and view the part-00000 file to see the output (should match your testing results).</p>
<p><img loading="lazy" decoding="async" class="alignnone wp-image-2141 size-full" src="/wp-content/uploads/2012/12/122812_2018_Programming5.png" alt="122812_2018_Programming5" width="268" height="173" /></p>
]]></content:encoded>
					
					<wfw:commentRss>https://prologika.com/programming-mapreduce-jobs-with-hdinsight-server-for-windows/feed/</wfw:commentRss>
			<slash:comments>4</slash:comments>
		
		
			</item>
		<item>
		<title>Installing HDInsight Server for Windows</title>
		<link>https://prologika.com/installing-hdinsight-server-for-windows/</link>
					<comments>https://prologika.com/installing-hdinsight-server-for-windows/#respond</comments>
		
		<dc:creator><![CDATA[Prologika - Teo Lachev]]></dc:creator>
		<pubDate>Thu, 01 Nov 2012 02:10:00 +0000</pubDate>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[Hadoop]]></category>
		<guid isPermaLink="false">/CS/blogs/blog/archive/2012/10/31/installing-hdinsight-server-for-windows.aspx</guid>

					<description><![CDATA[As you&#8217;ve probably heard the news, Microsoft rebranded their Big Data offerings as HDInsight that currently encompasses two key services: Windows Azure HDInsight Service (formerly known as Hadoop-based Services on [&#8230;]]]></description>
										<content:encoded><![CDATA[<p>As you&#8217;ve probably heard the news, Microsoft rebranded their Big Data offerings as HDInsight that currently encompasses two key services:</p>
<ul>
<li>Windows Azure HDInsight Service (formerly known as Hadoop-based Services on Windows Azure) – This is a cloud-based Hadoop distribution hosted on Windows Azure.</li>
<li>
<div><a href="http://www.microsoft.com/web/gallery/install.aspx?appid=HDINSIGHT-PREVIEW">Microsoft HDInsight Server</a> for Windows – A Windows-based Hadoop distribution that offers two main benefits for Big Data customers:</div>
<ul>
<li>An officially supported Hadoop distribution on Windows server – Previously, you can set up Hadoop on Windows as an unsupported installation (via Cygwin) for development purposes. What this means for you is that you can now set up a Hadoop cluster on servers running Windows Server OS.</li>
<li>Extends the reach of the Hadoop ecosystem to .NET developers and allows them to write MapReduce jobs in .NET code, such as C#.</li>
</ul>
</li>
</ul>
<p>Both services are available as preview offerings and changes are expected as they evolve. The Installing the Developer Preview of Apache Hadoop-based services on Windows <a href="http://social.technet.microsoft.com/wiki/contents/articles/14141.installing-the-developer-preview-of-apachetm-hadooptm-based-services-on-windows.aspx">article</a> covers the setup steps pretty well. I decided to set up HDInsight Server for Windows by installing the <a href="http://www.microsoft.com/web/gallery/install.aspx?appid=HDINSIGHT-PREVIEW">Microsoft Web Platform Installer</a> on my Windows 8 laptop.</p>
<p style="background: #d0cece;"><strong>Note</strong> Initially, I planned to install HDInsight Server for Windows on a VM running Windows Server 2012 Standard Edition. Although the installer completed successfully, it failed to create the sites and shortcuts to the dashboards (Hadoop Name Node, Dashboard, and MapRaduce). This was probably caused by the fact that server was configured as a domain controller. There is an ongoing discussion about this issue on the Microsoft HDInsight <a href="http://social.msdn.microsoft.com/Forums/en-US/hdinsight/thread/a0a25c89-2d28-4f52-83e2-5161211f7d28">forum</a>.</p>
<p>The Windows 8 setup failed to create the shortcut to the dashboard. However, the following steps fixed the issue:</p>
<p>1. Open up an Administrator PowerShell prompt and elevate the execution policy of the PowerShell to accept scripts.</p>
<p><span style="font-family: Courier New; font-size: 10pt;">PS:&gt; Set-ExecutionPolicy RemoteSigned </span></p>
<p>2. Navigate to the C:\HadoopFeaturePackSetup\HadoopFeaturePackSetupTools folder:</p>
<p><span style="color: black; font-family: Courier New; font-size: 10pt;">cd C<span style="color: #666666;">:<span style="color: black;">\HadoopFeaturePackSetup\HadoopFeaturePackSetupTools </span></span></span></p>
<ul>
<li>Install HadoopWebApi</li>
</ul>
<p><span style="color: black; font-family: Courier New; font-size: 10pt;">.\winpkg.ps1 ..\Packages\HadoopWebApi-winpkg.zip install -CredentialFilePath c:\Hadoop\Singlenodecreds.xml </span></p>
<ul>
<li>Install the dashboard</li>
</ul>
<p><span style="color: black; font-family: Courier New;">.\winpkg.ps1 ..\Packages\HadoopDashboard-winpkg.zip install -CredentialFilePath c:\Hadoop\Singlenodecreds.xml </span></p>
<p>This should create the shortcuts on the desktop and you should be able to navigate to http://localhost:8085 to access the dashboard.</p>
<p><img loading="lazy" decoding="async" class="alignnone wp-image-2160 size-full" src="/wp-content/uploads/2012/11/110112_0205_InstallingH1.png" alt="110112_0205_InstallingH1" width="397" height="341" srcset="https://prologika.com/wp-content/uploads/2012/11/110112_0205_InstallingH1.png 397w, https://prologika.com/wp-content/uploads/2012/11/110112_0205_InstallingH1-300x258.png 300w" sizes="auto, (max-width: 397px) 100vw, 397px" /></p>
<p>From here, you can open the Interactive Console and your experience should be the same as Windows Azure HDInsight Service. David Zhang has a <a href="http://www.youtube.com/watch?v=alPMYcomUEs">great coverage</a> of how you can use the Interactive Console in his video presentation &#8220;Introduction to the Hadoop on Azure Interactive JavaScript Console&#8221;.</p>
<p>BTW, HDInsight Server installs a set of Windows services corresponding to the UNIX daemons when Hadoop is installed on UNIX.</p>
<p><img loading="lazy" decoding="async" class="alignnone wp-image-2161 size-full" src="/wp-content/uploads/2012/11/110112_0205_InstallingH2.png" alt="110112_0205_InstallingH2" width="250" height="193" /></p>
]]></content:encoded>
					
					<wfw:commentRss>https://prologika.com/installing-hdinsight-server-for-windows/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Hadoop and Big Data Tonight with Atlanta BI Group</title>
		<link>https://prologika.com/hadoop-and-big-data-tonight-with-atlanta-bi-group/</link>
					<comments>https://prologika.com/hadoop-and-big-data-tonight-with-atlanta-bi-group/#respond</comments>
		
		<dc:creator><![CDATA[Prologika - Teo Lachev]]></dc:creator>
		<pubDate>Mon, 29 Oct 2012 19:46:15 +0000</pubDate>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[Hadoop]]></category>
		<guid isPermaLink="false">/CS/blogs/blog/archive/2012/10/29/hadoop-and-big-data-tonight-with-atlanta-bi-group.aspx</guid>

					<description><![CDATA[Atlanta BI Group is meeting tonight. The Topic is Hadoop and Big Data by Ketan Dave and our sponsor is Enterprise Software Solutions. With wide acceptance of open source technologies [&#8230;]]]></description>
										<content:encoded><![CDATA[<p>Atlanta BI Group is <a href="http://atlantabi.sqlpass.org/">meeting</a> tonight. The Topic is Hadoop and Big Data by Ketan Dave and our sponsor is Enterprise Software Solutions.</p>
<p><em>With wide acceptance of open source technologies , Hadoop/Map Reduce has become a viable option when it comes implementing the 100 of Terabytes to Petabytes of Data solutions. Scalability, Reliability , Versatility and Cost benefits of Hadoop based system is replacing traditional approach of data solutions. Microsoft has partnered with Hadoop vendors, have recently made announcements to make data on Hadoop accessible by Excel, easily linked to SQL Server and its business intelligence, analytical and reporting tools for business intelligence and managed through Active Directory.<br />
</em></p>
<p>I hope you can make it!</p>
]]></content:encoded>
					
					<wfw:commentRss>https://prologika.com/hadoop-and-big-data-tonight-with-atlanta-bi-group/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>MS Guy Does Hadoop (Part 4 – Analyzing Data)</title>
		<link>https://prologika.com/ms-guy-does-hadoop-part-4-analyzing-data/</link>
					<comments>https://prologika.com/ms-guy-does-hadoop-part-4-analyzing-data/#respond</comments>
		
		<dc:creator><![CDATA[Prologika - Teo Lachev]]></dc:creator>
		<pubDate>Sat, 30 Jun 2012 21:32:00 +0000</pubDate>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[Hadoop]]></category>
		<guid isPermaLink="false">/CS/blogs/blog/archive/2012/06/30/ms-guy-does-hadoop-part-4-analyzing-data.aspx</guid>

					<description><![CDATA[In my previous blog, I talked about Hive. Hive provides a SQL-like layer on top of Hadoop so you don&#8217;t have write tons of MapReduce code to query Hadoop and [&#8230;]]]></description>
										<content:encoded><![CDATA[<p>In my previous <a href="/CS/blogs/blog/archive/2012/06/24/ms-guy-does-hadoop-part-3-hive.aspx">blog</a>, I talked about Hive. Hive provides a SQL-like layer on top of Hadoop so you don&#8217;t have write tons of MapReduce code to query Hadoop and to aggregate and join data. To facilitate working with Hive, Microsoft introduced a <a href="http://social.technet.microsoft.com/wiki/contents/articles/6226.how-to-connect-excel-to-hadoop-on-azure-via-hiveodbc-en-us.aspx">Hive ODBC driver</a> (as of this writing, the driver is only available to Hadoop on Azure CTP subscribers). You can use this driver to connect to Hive running on Microsoft Azure or your local Hadoop server. Danny Lee has <a href="http://dennyglee.com/2012/01/21/connecting-powerpivot-to-hadoop-on-azure-self-service-bi-to-big-data-in-the-cloud/">provided</a> detailed instructions of how to do the former. I&#8217;ll show you how to use it to connect to your local Hive server.</p>
<h1>Start the Hive Server</h1>
<p>If you use the Cloudera VM, the Hive server is not running by default. This service allows external clients to connect to Hive. To start it:</p>
<ol>
<li>Configure your Cloudera VM to obtain an IP address on your network. To do so in Oracle Virtual Box, go to the VM settings (Network tab), and change the network adapter to Bridge Adapter.</li>
</ol>
<p><img loading="lazy" decoding="async" class="alignnone wp-image-2223 size-full" src="/wp-content/uploads/2012/06/063012_2132_MSGuyDoesHa1.png" alt="063012_2132_MSGuyDoesHa1" width="468" height="350" srcset="https://prologika.com/wp-content/uploads/2012/06/063012_2132_MSGuyDoesHa1.png 468w, https://prologika.com/wp-content/uploads/2012/06/063012_2132_MSGuyDoesHa1-450x337.png 450w, https://prologika.com/wp-content/uploads/2012/06/063012_2132_MSGuyDoesHa1-300x224.png 300w" sizes="auto, (max-width: 468px) 100vw, 468px" /></p>
<ol>
<li>Start the Cloudera VM and open the command prompt.</li>
<li>Note the IP address assigned to the VM:</li>
</ol>
<p><span style="font-family: Courier New;">[cloudera@localhost ]$ ifconfig </span></p>
<p style="margin-left: 18pt;"><span style="font-family: Courier New;">[cloudera@localhost ~]$ ifconfig </span></p>
<p style="margin-left: 18pt;"><span style="font-family: Courier New;">eth0 Link encap:Ethernet HWaddr 08:00:27:A0:6C:DC </span></p>
<p style="margin-left: 18pt;"><span style="font-family: Courier New;">inet addr:192.168.1.111 Bcast:192.168.1.255 Mask:255.255.255.0 </span></p>
<p style="margin-left: 18pt;"><span style="font-family: Courier New;">UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 </span></p>
<p style="margin-left: 18pt;"><span style="font-family: Courier New;">RX packets:4320 errors:0 dropped:0 overruns:0 frame:0 </span></p>
<p style="margin-left: 18pt;"><span style="font-family: Courier New;">TX packets:2122 errors:0 dropped:0 overruns:0 carrier:0 </span></p>
<p style="margin-left: 18pt;"><span style="font-family: Courier New;">collisions:0 txqueuelen:1000 </span></p>
<p style="margin-left: 18pt;"><span style="font-family: Courier New;">RX bytes:3762720 (3.5 MiB) TX bytes:251411 (245.5 KiB)</span></p>
<p>       3..  If your host OS is Windows, edit the C:\Windows\System32\drivers\etc\host file and add an entry for that address, e.g.:</p>
<p><span style="font-family: Courier New;">192.168.1.111    cloudera </span></p>
<p>4.  Ping the VM from the host OS to make sure it responds on the DNS name</p>
<p><span style="font-family: Courier New;">C:&gt; ping cloudera </span></p>
<p>5.  Start the Hive server using this command:</p>
<p><span style="font-family: Courier New;">[cloudera@localhost ]$ hive &#8211;service hiveserver </span></p>
<p>By default, the Hive server listens on port 10000.</p>
<h1>Analyze Data in Excel</h1>
<p>There are two ways to bring Hive results in Excel and both options require the Hive ODBC driver:</p>
<ul>
<li>You can use the Hive Pane to import data. This option provides a basic user interface, called a Hive Pane, which is capable of auto-generating Hive queries.</li>
<li>Import Hive tables directly into PowerPivot for Excel.</li>
</ul>
<h2>Using the Hive Pane</h2>
<p>Once you install the Hive ODBC driver, you&#8217;ll get a new button in the Data ribbon group called Hive Pane.</p>
<ol>
<li>Click the Enter Cluster Details button. In the Host field, enter whatever name you specified in the host file (cloudera in my case). Note that the default port is set to 10000. Click OK. You shouldn&#8217;t see errors at this point.</li>
<li>Expand the Select the Hive Object to Query and select a table. Select which columns you want to bring in. Optionally, specify criteria, aggregate grouping, and ordering. Notice that by default, the driver brings the first 200 rows but you can use the Limit Rows section to overwrite the default.</li>
<li>Click Execute Query to run the query and generate a table in Excel.</li>
<li>From there on, you can use the Excel native PivotTable and PivotChart reports to analyze data or link the data to PowerPivot.</li>
</ol>
<p><img loading="lazy" decoding="async" class="alignnone wp-image-2224 size-full" src="/wp-content/uploads/2012/06/063012_2132_MSGuyDoesHa2.png" alt="063012_2132_MSGuyDoesHa2" width="766" height="585" srcset="https://prologika.com/wp-content/uploads/2012/06/063012_2132_MSGuyDoesHa2.png 766w, https://prologika.com/wp-content/uploads/2012/06/063012_2132_MSGuyDoesHa2-450x344.png 450w, https://prologika.com/wp-content/uploads/2012/06/063012_2132_MSGuyDoesHa2-300x229.png 300w, https://prologika.com/wp-content/uploads/2012/06/063012_2132_MSGuyDoesHa2-705x538.png 705w" sizes="auto, (max-width: 766px) 100vw, 766px" /></p>
<h2>Importing Data in PowerPivot</h2>
<p>The second option is to bypass the Hive Pane and import a Hive table directly into PowerPivot. To do so, you need to set up a file data source first.</p>
<ol>
<li>In Windows, go to Administrative Tools and click Data Sources (ODBC).</li>
<li>In the ODBC Data Source Administrator, click the File DSN tab, and then click the Add button.</li>
<li>In the Create New Data Source dialog box, select the HIVE driver.</li>
</ol>
<p><img loading="lazy" decoding="async" class="alignnone wp-image-2226 size-full" src="/wp-content/uploads/2012/06/063012_2132_MSGuyDoesHa3.png" alt="063012_2132_MSGuyDoesHa3" width="499" height="399" srcset="https://prologika.com/wp-content/uploads/2012/06/063012_2132_MSGuyDoesHa3.png 499w, https://prologika.com/wp-content/uploads/2012/06/063012_2132_MSGuyDoesHa3-450x360.png 450w, https://prologika.com/wp-content/uploads/2012/06/063012_2132_MSGuyDoesHa3-300x240.png 300w" sizes="auto, (max-width: 499px) 100vw, 499px" /></p>
<ol>
<li>Click Next and save the file data source, such as in the C:\Users\Teo\Documents\My Data Sources folder. Ignore the warning that pops up.</li>
<li>Back to the ODBC Data Source Administrator (File DSN tab), browse to the folder where you saved the file data source, select it, and click Configure. That will bring you to the same ODBC Hive Setup where you specify the Hadoop server name and port. Close the ODBC Data Source Administrator.</li>
<li>Back to Excel, click the PowerPivot ribbon menu, and then click the PowerPivot Window.</li>
<li>In the PowerPivot Window Home tab, click the From Other Sources button in the Get External Data ribbon group.</li>
<li>In the Table Import Wizard, select the Others (OLEDB/ODBC) option, and then click Next.</li>
<li>In the Specify a Connection String, click the Build button to open the Data Link Properties.</li>
<li>
<div>Select the Provider tab and then select the Microsoft OLE DB Provider for ODBC Drivers.</div>
<p><img loading="lazy" decoding="async" class="alignnone wp-image-2227 size-full" src="/wp-content/uploads/2012/06/063012_2132_MSGuyDoesHa4.png" alt="063012_2132_MSGuyDoesHa4" width="377" height="473" srcset="https://prologika.com/wp-content/uploads/2012/06/063012_2132_MSGuyDoesHa4.png 377w, https://prologika.com/wp-content/uploads/2012/06/063012_2132_MSGuyDoesHa4-239x300.png 239w" sizes="auto, (max-width: 377px) 100vw, 377px" /></li>
<li>Select the Connection tab. Select the Use Connection String option, and then click the Build button.</li>
<li>In the Select Data Source dialog box, browse to the folder where you saved the file data source, select it, and then click OK to return back to the Data Link Properties.</li>
</ol>
<p><img loading="lazy" decoding="async" class="alignnone size-medium wp-image-2228" src="/wp-content/uploads/2012/06/063012_2132_MSGuyDoesHa5-300x200.png" alt="063012_2132_MSGuyDoesHa5" width="300" height="200" srcset="https://prologika.com/wp-content/uploads/2012/06/063012_2132_MSGuyDoesHa5-300x200.png 300w, https://prologika.com/wp-content/uploads/2012/06/063012_2132_MSGuyDoesHa5-450x300.png 450w, https://prologika.com/wp-content/uploads/2012/06/063012_2132_MSGuyDoesHa5.png 623w" sizes="auto, (max-width: 300px) 100vw, 300px" /></p>
<p>The Connection String field should now be populated with the following text:</p>
<p><span style="font-family: Courier New;">DRIVER={HIVE};Description=;HOST=cloudera;DATABASE=default;PORT=10000;FRAMED=0;AUTHENTICATION=0;AUTH_DATA=;UID=;PWD= </span></p>
<p>10.   Click the Test Connection button to verify connectivity. Click OK to return to the Table Import Wizard which should now have the following connection string:</p>
<p><span style="font-family: Courier New;">Provider=MSDASQL.1;Persist Security Info=False;Extended Properties=&#8221;DRIVER={HIVE};Description=;HOST=cloudera;DATABASE=default;PORT=10000;FRAMED=0;AUTHENTICATION=0;AUTH_DATA=;UID=;&#8221; </span></p>
<p>Follow the wizard, to import the Hive tables as you would with any other data source.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://prologika.com/ms-guy-does-hadoop-part-4-analyzing-data/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>MS Guy Does Hadoop (Part 3 – Hive)</title>
		<link>https://prologika.com/ms-guy-does-hadoop-part-3-hive/</link>
					<comments>https://prologika.com/ms-guy-does-hadoop-part-3-hive/#respond</comments>
		
		<dc:creator><![CDATA[Prologika - Teo Lachev]]></dc:creator>
		<pubDate>Sun, 24 Jun 2012 23:54:00 +0000</pubDate>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[Hadoop]]></category>
		<guid isPermaLink="false">/CS/blogs/blog/archive/2012/06/24/ms-guy-does-hadoop-part-3-hive.aspx</guid>

					<description><![CDATA[Writing MapReduce Java jobs might be OK for simple analytical needs or distributing processing jobs but it might be challenging for more involved scenarios, such as joining two datasets. This [&#8230;]]]></description>
										<content:encoded><![CDATA[<p>Writing MapReduce Java jobs might be OK for simple analytical needs or distributing processing jobs but it might be challenging for more involved scenarios, such as joining two datasets. This is where Hive comes in. Hive was originally developed by the Facebook data warehousing team after they concluded that &#8220;… developers ended up spending hours (if not days) to write programs for even simple analyses&#8221;. Instead, Hive offers a SQL–like language that is capable of auto-generating the MapReduce code.</p>
<h1>The Hive Shell</h1>
<p>Hive introduces the notion of a &#8220;table&#8221; on top of data. It has its own shell which can be invoked by typing &#8220;hive&#8221; in the command window. The following command shows the Hive tables. I have defined two tables: <strong>records</strong> and <strong>records_ex</strong>.</p>
<p><span style="font-family: Courier New;">[cloudera@localhost book]$ hive </span></p>
<p><span style="font-family: Courier New;">hive&gt; show tables; </span></p>
<p><span style="font-family: Courier New;">OK </span></p>
<p><span style="font-family: Courier New;">records </span></p>
<p><span style="font-family: Courier New;">records_ex </span></p>
<p><span style="font-family: Courier New;">Time taken: 4.602 seconds </span></p>
<p><span style="font-family: Courier New;">hive&gt; </span></p>
<p>&nbsp;</p>
<h1>Creating a Managed Table</h1>
<p>Suppose you have a file with the following tab-delimited format:</p>
<p>1950    0    1</p>
<p>1950    22    1</p>
<p>1950    -11    1</p>
<p>1949    111    1</p>
<p>1949    78    1</p>
<p>&nbsp;</p>
<p>The following Hive statement creates a <strong>records</strong> table with three columns.</p>
<p><span style="font-family: Courier New; font-size: 10pt;">hive&gt; CREATE TABLE records (year STRING, temperature INT, quality INT) </span></p>
<p><span style="font-family: Courier New; font-size: 10pt;">ROW FORMAT DELIMITED </span></p>
<p><span style="font-family: Courier New; font-size: 10pt;">FIELDS TERMINATED BY &#8216;\t&#8217;; </span></p>
<p>Next, we use the LOAD DATA statement to populate the records table with data from a file located on the local file system:</p>
<p><span style="font-family: Courier New; font-size: 10pt;">LOAD DATA LOCAL INPATH &#8216;input/ncdc/micro-tab/sample.txt&#8217; </span></p>
<p><span style="font-family: Courier New; font-size: 10pt;">OVERWRITE INTO TABLE records; </span></p>
<p>This causes Hive to move the file to its repository on local file system (/hive/warehouse). Therefore, by default, Hive will manage the table. If you drop the table, Hive will delete the source data.</p>
<h1>Creating an External Table</h1>
<p>What if the data is already in HDFS and you don&#8217;t want to move the files? In this case, you can tell Hive that the table will be external to Hive and you&#8217;ll manage the data. Suppose that you&#8217;ve already copied the sample.txt file to HDFS:</p>
<p><span style="font-family: Courier New; font-size: 10pt;">&gt;hive[cloudera@localhost ~]$ hadoop dfs -ls /user/cloudera/input/ncdc </span></p>
<p><span style="font-family: Courier New; font-size: 10pt;">Found 1 items </span></p>
<p><span style="font-family: Courier New; font-size: 10pt;">-rw-r&#8211;r&#8211; 1 cloudera supergroup 529 2012-06-07 16:24 /user/cloudera/input/ncdc/sample.txt </span></p>
<p>Next, we tell Hive to create an external table:</p>
<p><span style="font-family: Courier New; font-size: 10pt;">CREATE <strong>EXTERNAL</strong> TABLE records_ex (year STRING, temperature INT, quality INT) </span></p>
<p><span style="font-family: Courier New; font-size: 10pt;">LOCATION &#8216;/user/cloudera/records_ex&#8217;; </span></p>
<p><span style="font-family: Courier New; font-size: 10pt;">LOAD DATA INPATH &#8216;/input/ncdc/sample.txt&#8217; </span></p>
<p><span style="font-family: Courier New; font-size: 10pt;">OVERWRITE INTO TABLE records_ex </span></p>
<p>The EXTERNAL clause causes Hive to leave the data where it is without even checking if the file exists. The INPATH clause points to the source file. The OVEWRITE clause causes the existing data to be removed.</p>
<p><span style="color: #17365d; font-size: 15pt;"><strong>Querying Data </strong></span></p>
<p>The Hive SQL variant language is called <a href="https://cwiki.apache.org/confluence/display/Hive/LanguageManual">HiveQL</a>. HiveQL does not support the full SQL-92 specification as this wasn&#8217;t a design goal. The following two examples show how to select all data from our table.</p>
<p><span style="font-family: Courier New;">hive&gt; select * from records_ex; </span></p>
<p><span style="font-family: Courier New;">OK </span></p>
<p><span style="font-family: Courier New;">1950 0 1 </span></p>
<p><span style="font-family: Courier New;">1950 22 1 </span></p>
<p><span style="font-family: Courier New;">1950 -11 1 </span></p>
<p><span style="font-family: Courier New;">1949 111 1 </span></p>
<p><span style="font-family: Courier New;">1949 78 1 </span></p>
<p><span style="font-family: Courier New;">Time taken: 0.527 seconds </span></p>
<p><span style="font-family: Courier New;">hive&gt; <strong>SELECT year, MAX(temperature)</strong> </span></p>
<p><span style="font-family: Courier New;">&gt; <strong>FROM records</strong> </span></p>
<p><span style="font-family: Courier New;">&gt; <strong>WHERE temperature != 9999</strong> </span></p>
<p><span style="font-family: Courier New;">&gt; <strong>AND quality in (1,2)</strong> </span></p>
<p><span style="font-family: Courier New;">&gt; <strong>GROUP BY year;</strong> </span></p>
<p><span style="font-family: Courier New;">Total MapReduce jobs = 1 </span></p>
<p><span style="font-family: Courier New;">Launching Job 1 out of 1 </span></p>
<p><span style="font-family: Courier New;">Number of reduce tasks not specified. Estimated from input data size: 1 </span></p>
<p><span style="font-family: Courier New;">In order to change the average load for a reducer (in bytes): </span></p>
<p><span style="font-family: Courier New;">set hive.exec.reducers.bytes.per.reducer=&lt;number&gt; </span></p>
<p><span style="font-family: Courier New;">In order to limit the maximum number of reducers: </span></p>
<p><span style="font-family: Courier New;">set hive.exec.reducers.max=&lt;number&gt; </span></p>
<p><span style="font-family: Courier New;">In order to set a constant number of reducers: </span></p>
<p><span style="font-family: Courier New;">set mapred.reduce.tasks=&lt;number&gt; </span></p>
<p><span style="font-family: Courier New;">Starting Job = job_201206241704_0001, Tracking URL = http://0.0.0.0:50030/jobdetails.jsp?jobid=job_201206241704_0001 </span></p>
<p><span style="font-family: Courier New;">Kill Command = /usr/lib/hadoop/bin/hadoop job -Dmapred.job.tracker=0.0.0.0:8021 -kill job_201206241704_0001 </span></p>
<p><span style="font-family: Courier New;">2012-06-24 18:21:15,022 Stage-1 map = 0%, reduce = 0% </span></p>
<p><span style="font-family: Courier New;">2012-06-24 18:21:19,106 Stage-1 map = 100%, reduce = 0% </span></p>
<p><span style="font-family: Courier New;">2012-06-24 18:21:30,212 Stage-1 map = 100%, reduce = 100% </span></p>
<p><span style="font-family: Courier New;">Ended Job = job_201206241704_0001 </span></p>
<p><span style="font-family: Courier New;">OK </span></p>
<p><span style="font-family: Courier New;">1949 111 </span></p>
<p><span style="font-family: Courier New;">1950 22 </span></p>
<p><span style="font-family: Courier New;">Time taken: 26.779 seconds </span></p>
<p>As you can see from the second example, Hive generates a MapReduce job. Please don&#8217;t make any conclusions from the fact that this simple query takes 26 seconds on my VM while it would take a millisecond to execute on any modern relational database. It takes quite a bit of time to instantiate MapReduce jobs and end users probably won&#8217;t query Hadoop directly anyway. Besides, the performance results will probably look completely different with hundreds of terabytes of data.</p>
<p>In a future blog on Hadoop, I plan to summarize my research on Hadoop and recommend usage scenarios.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://prologika.com/ms-guy-does-hadoop-part-3-hive/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>MS BI Guy Does Hadoop (Part 2 – Taking Hadoop for a Spin)</title>
		<link>https://prologika.com/ms-bi-guy-does-hadoop-part-2-taking-hadoop-for-a-spin/</link>
					<comments>https://prologika.com/ms-bi-guy-does-hadoop-part-2-taking-hadoop-for-a-spin/#respond</comments>
		
		<dc:creator><![CDATA[Prologika - Teo Lachev]]></dc:creator>
		<pubDate>Sun, 10 Jun 2012 00:50:00 +0000</pubDate>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[Hadoop]]></category>
		<guid isPermaLink="false">/CS/blogs/blog/archive/2012/06/09/ms-bi-guy-does-hadoop-part-2-taking-hadoop-for-a-spin.aspx</guid>

					<description><![CDATA[In part 1 of my Hadoop adventures, I walked you through the steps of setting the Cloudera virtual machine, which comes with CentOS and Hadoop preinstalled. Now, I&#8217;ll go through [&#8230;]]]></description>
										<content:encoded><![CDATA[<p>In <a href="/CS/blogs/blog/archive/2012/06/03/ms-bi-guy-does-hadoop-part-1-getting-started.aspx">part 1</a> of my Hadoop adventures, I walked you through the steps of setting the Cloudera virtual machine, which comes with CentOS and Hadoop preinstalled. Now, I&#8217;ll go through the steps to run a small Hadoop program for analyzing weather data. The program and the code samples can be downloaded from the <a href="https://github.com/tomwhite/hadoop-book/tree/3e/">source code</a> that accompanies the book Hadoop: The Definitive Guide (3<sup>rd</sup> Edition) by Tom White. Again, the point of this exercise is to benefit Windows users who aren&#8217;t familiar with Unix but are willing to evaluate Hadoop in Unix environment.</p>
<h1>Downloading the Source Code</h1>
<p>Let&#8217;s start by downloading the book source and the sample dataset:</p>
<ol>
<li>Start the Cloudera VM, log in, and open the File Manager and create a folder <strong>downloads</strong> as a subfolder of the cloudera folder (this is your home folder because you log in to CentOS as user cloudera). Then, create a folder <strong>book</strong> under the <strong>downloads</strong> folder.</li>
<li>Open Firefox and navigate to the book <a href="https://github.com/tomwhite/hadoop-book/tree/3e/">source code</a> page, and click the Zip button. Then, save the file to the <strong>book</strong> folder.</li>
<li>
<div>Open the File Manager and navigate to the /cloudera/downloads folder. Right-click the <strong>book</strong> folder and click Open Terminal Here. Enter the following command to extract the file:<br />
<span style="font-family: Courier New;">[cloudera@localhost]$ unzip tomwhite-hadoop-book-3e-draft-6-gc5b14af.zip</span></div>
</li>
<li>
<div>Unzipping the file, creates a folder tomwhite-hadoop-book-c5b14af and extracts the files in it. To minimize the number of folder nesting, use the File Manager to navigate do the/book/tomwhite-hadoop-book-c5b14af folder, press Ctrl+A to copy all files and paste them into the /cloudera/downloads/books folder. You can then delete the tomwhite-hadoop-book-c5b14af folder.</div>
<p><img loading="lazy" decoding="async" class="alignnone wp-image-2232 size-full" src="/wp-content/uploads/2012/06/061012_0049_MSBIGuyDoes1.png" alt="061012_0049_MSBIGuyDoes1" width="645" height="426" srcset="https://prologika.com/wp-content/uploads/2012/06/061012_0049_MSBIGuyDoes1.png 645w, https://prologika.com/wp-content/uploads/2012/06/061012_0049_MSBIGuyDoes1-450x297.png 450w, https://prologika.com/wp-content/uploads/2012/06/061012_0049_MSBIGuyDoes1-300x198.png 300w" sizes="auto, (max-width: 645px) 100vw, 645px" /></li>
</ol>
<h1>Building the Source Code</h1>
<p>Next, you need to compile the source code and build the Java JAR files for the book samples.</p>
<p style="background: #d9d9d9;"><strong>Tip</strong> I failed to build the entire source code from the first try because my virtual machine ran out of memory when building the ch15 code. Therefore, before building the source, increase the memory of the Cloudera VM to 3 GB.</p>
<ol>
<li>Download and <a href="http://maven.apache.org/download.html">install</a> Maven. Think of Maven as MSBUILD. You might find also the following <a href="http://pwong-tipsandtricks.blogspot.com/2009/02/install-and-test-maven-on-centos-52.html">instructions</a> helpful to install Maven.</li>
<li>
<div>Open the Terminal window (command prompt) and create the following environment variables so you don&#8217;t have to reference directly the Hadoop version and folder where Hadoop is installed:</div>
<p><span style="font-family: Courier New;">[cloudera@localhost]$ export HADOOP_HOME=/usr/lib/hadoop-0.20 </span></p>
<p><span style="font-family: Courier New;">[cloudera@localhost]$ export HADOOP_VERSION=0.20.2-cdh3u4</span></li>
<li>
<div>In the terminal window, navigate to the /cloudera/downloads/book and build the book source code with Maven using the following command. If the command is successful, it should show a summary that all projects are built successfully and place a file <strong>hadoop-examples.jar</strong> in the book folder.</div>
<p><span style="font-family: Courier New;">[cloudera@localhost book] $ mvn package -DskipTests -Dhadoop.version=1.0.2</span></li>
</ol>
<ol>
<li>
<div>Next, copy the input dataset with the weather data that Hadoop will analyze. For testing purposes, we&#8217;ll use a very small dataset which represents the weather datasets kept by <a href="http://www.ncdc.noaa.gov/oa/ncdc.html">National Climatic Data Center (NCDC)</a>. Our task it to parse the files in order to obtain the maximum temperature per year. The mkdir command creates a /user/cloudera/input/ncdc folder in the Hadoop file system (HDFS). Next, we copy the file from the local file system to HDFS using <span style="font-family: Courier New;">put</span>.</div>
<p><span style="font-family: Courier New;">[cloudera@localhost book]$ su root </span></p>
<p><span style="font-family: Courier New;">[root@localhost book]# /usr/bin/hadoop dfs -mkdir /user/cloudera/input/ncdc </span></p>
<p><span style="font-family: Courier New;">[root@localhost book]# /usr/bin/hadoop dfs -put ./input/ncdc/sample.txt /user/cloudera/input/ncdc </span></p>
<p><span style="font-family: Courier New;">hadoop dfs -ls /user/cloudera/input/ncdc </span></p>
<p><span style="font-family: Courier New;">-rw-r&#8211;r&#8211; 1 cloudera supergroup 529 2012-06-07 16:24 /user/cloudera/input/ncdc/sample.txt</span></li>
</ol>
<p>The input file is a fixed-width file with the following content (I highlight the year and temperature sections).</p>
<p style="margin-left: 18pt;"><span style="font-family: Courier New;">006701199099999<strong>1950</strong>051507004+68750+023550FM-12+038299999V0203301N00671220001CN9999999N9+<strong>00001</strong>+99999999999 </span></p>
<p style="margin-left: 18pt;"><span style="font-family: Courier New;">004301199099999<strong>1950</strong>051512004+68750+023550FM-12+038299999V0203201N00671220001CN9999999N9+<strong>00221</strong>+99999999999 </span></p>
<p style="margin-left: 18pt;"><span style="font-family: Courier New;">004301199099999<strong>1950</strong>051518004+68750+023550FM-12+038299999V0203201N00261220001CN9999999N9-<strong>00111</strong>+99999999999 </span></p>
<p style="margin-left: 18pt;"><span style="font-family: Courier New;">004301265099999<strong>1949</strong>032412004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+<strong>01111</strong>+99999999999 </span></p>
<p style="margin-left: 18pt;"><span style="font-family: Courier New;">004301265099999<strong>1949</strong>032418004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+<strong>00781</strong>+99999999999 </span></p>
<h1>Analyzing Data</h1>
<p>Now, it&#8217;s time to run the code sample and analyze the weather data.</p>
<ol>
<li>Run the MaxTemperature application.</li>
</ol>
<p><span style="font-family: Courier New;">[root@localhost book]# /usr/bin/hadoop MaxTemperature input/ncdc/sample.txt output </span></p>
<p><span style="font-family: Courier New;">[cloudera@localhost book]$ hadoop MaxTemperature input/ncdc/sample.txt output </span></p>
<p><span style="font-family: Courier New;">12/06/07 16:25:44 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. </span></p>
<p><span style="font-family: Courier New;">12/06/07 16:25:44 INFO input.FileInputFormat: Total input paths to process : 1 </span></p>
<p><span style="font-family: Courier New;">12/06/07 16:25:44 WARN snappy.LoadSnappy: Snappy native library is available </span></p>
<p><span style="font-family: Courier New;">12/06/07 16:25:44 INFO util.NativeCodeLoader: Loaded the native-hadoop library </span></p>
<p><span style="font-family: Courier New;">12/06/07 16:25:44 INFO snappy.LoadSnappy: Snappy native library loaded </span></p>
<p><span style="font-family: Courier New;">12/06/07 16:25:45 INFO mapred.JobClient: Running job: job_201206071457_0008 </span></p>
<p><span style="font-family: Courier New;">12/06/07 16:25:46 INFO mapred.JobClient: map 0% reduce 0% </span></p>
<p><span style="font-family: Courier New;">12/06/07 16:25:54 INFO mapred.JobClient: map 100% reduce 0% </span></p>
<p><span style="font-family: Courier New;">12/06/07 16:26:05 INFO mapred.JobClient: map 100% reduce 100% </span></p>
<p><span style="font-family: Courier New;">12/06/07 16:26:06 INFO mapred.JobClient: Job complete: job_201206071457_0008 </span></p>
<p><span style="font-family: Courier New;">12/06/07 16:26:06 INFO mapred.JobClient: Counters: 26 </span></p>
<p><span style="font-family: Courier New;">12/06/07 16:26:06 INFO mapred.JobClient: Job Counters </span></p>
<p><span style="font-family: Courier New;">12/06/07 16:26:06 INFO mapred.JobClient: Launched reduce tasks=1 </span></p>
<p><span style="font-family: Courier New;">12/06/07 16:26:06 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=8493 </span></p>
<p><span style="font-family: Courier New;">12/06/07 16:26:06 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 </span></p>
<p><span style="font-family: Courier New;">12/06/07 16:26:06 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 </span></p>
<p><span style="font-family: Courier New;">12/06/07 16:26:06 INFO mapred.JobClient: Launched map tasks=1 </span></p>
<p><span style="font-family: Courier New;">12/06/07 16:26:06 INFO mapred.JobClient: Data-local map tasks=1 </span></p>
<p><span style="font-family: Courier New;">12/06/07 16:26:06 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=10370 </span></p>
<p><span style="font-family: Courier New;">12/06/07 16:26:06 INFO mapred.JobClient: FileSystemCounters </span></p>
<p><span style="font-family: Courier New;">12/06/07 16:26:06 INFO mapred.JobClient: FILE_BYTES_READ=61 </span></p>
<p><span style="font-family: Courier New;">12/06/07 16:26:06 INFO mapred.JobClient: HDFS_BYTES_READ=644 </span></p>
<p><span style="font-family: Courier New;">12/06/07 16:26:06 INFO mapred.JobClient: FILE_BYTES_WRITTEN=113206 </span></p>
<p><span style="font-family: Courier New;">12/06/07 16:26:06 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=17 </span></p>
<p><span style="font-family: Courier New;">12/06/07 16:26:06 INFO mapred.JobClient: Map-Reduce Framework </span></p>
<p><span style="font-family: Courier New;">12/06/07 16:26:06 INFO mapred.JobClient: Map input records=5 </span></p>
<p><span style="font-family: Courier New;">12/06/07 16:26:06 INFO mapred.JobClient: Reduce shuffle bytes=61 </span></p>
<p><span style="font-family: Courier New;">12/06/07 16:26:06 INFO mapred.JobClient: Spilled Records=10 </span></p>
<p><span style="font-family: Courier New;">12/06/07 16:26:06 INFO mapred.JobClient: Map output bytes=45 </span></p>
<p><span style="font-family: Courier New;">12/06/07 16:26:06 INFO mapred.JobClient: CPU time spent (ms)=1880 </span></p>
<p><span style="font-family: Courier New;">12/06/07 16:26:06 INFO mapred.JobClient: Total committed heap usage (bytes)=196022272 </span></p>
<p><span style="font-family: Courier New;">12/06/07 16:26:06 INFO mapred.JobClient: Combine input records=0 </span></p>
<p><span style="font-family: Courier New;">12/06/07 16:26:06 INFO mapred.JobClient: SPLIT_RAW_BYTES=115 </span></p>
<p><span style="font-family: Courier New;">12/06/07 16:26:06 INFO mapred.JobClient: Reduce input records=5 </span></p>
<p><span style="font-family: Courier New;">12/06/07 16:26:06 INFO mapred.JobClient: Reduce input groups=2 </span></p>
<p><span style="font-family: Courier New;">12/06/07 16:26:06 INFO mapred.JobClient: Combine output records=0 </span></p>
<p><span style="font-family: Courier New;">12/06/07 16:26:06 INFO mapred.JobClient: Physical memory (bytes) snapshot=236310528 </span></p>
<p><span style="font-family: Courier New;">12/06/07 16:26:06 INFO mapred.JobClient: Reduce output records=2 </span></p>
<p><span style="font-family: Courier New;">12/06/07 16:26:06 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1078792192 </span></p>
<p><span style="font-family: Courier New;">12/06/07 16:26:06 INFO mapred.JobClient: Map output records=5</span></p>
<ol>
<li>Hadoop generates an output file (part-r-00000) that includes the job results, which we can see by browsing HDFS:</li>
</ol>
<p><span style="font-family: Courier New;">[root@localhost book]# hadoop dfs -ls /user/cloudera/output </span></p>
<p><span style="font-family: Courier New;">Found 3 items </span></p>
<p><span style="font-family: Courier New;">-rw-r&#8211;r&#8211; 1 cloudera supergroup 0 2012-06-07 16:26 /user/cloudera/output/_SUCCESS </span></p>
<p><span style="font-family: Courier New;">drwxr-xr-x &#8211; cloudera supergroup 0 2012-06-07 16:25 /user/cloudera/output/_logs </span></p>
<p><span style="font-family: Courier New;">-rw-r&#8211;r&#8211; 1 cloudera supergroup 17 2012-06-07 16:26 /user/cloudera/output/part-r-00000 </span></p>
<ol>
<li>Browse the content of the file:</li>
</ol>
<p><span style="font-family: Courier New;">[root@localhost book]# hadoop dfs -cat /user/cloudera/output/part-r-00000 </span></p>
<p style="margin-left: 18pt;"><strong>1949 111</strong> # the max temperature for 1949 was 11.1 Celsius</p>
<p style="margin-left: 18pt;"><strong>1950 22 </strong># the max temperature for 1950 was 2.2 Celsius</p>
<h1>Understanding the Map Job</h1>
<p>The book provides detailed explanation of the source code. In a nutshell, the programmer has to implement:</p>
<ol>
<li>A Map job</li>
<li>(Optional) a Reduce job – You don&#8217;t need a Reduce job when there is no need to merge the map results, such as when processing can be carried out entirely in parallel (see my note below).</li>
<li>
<div>An application that ties the Mapper and the Reducer.</div>
</li>
</ol>
<p style="background: #d9d9d9;"><strong>Note</strong> What I learned from the book is that Hadoop is not just about analyzing data. There is nothing stopping you to write a Reduce job that does some kind of processing to take advantage of the distributed computing capabilities of Hadoop. For example, the New York Times used Amazon&#8217;s EC2 compute cloud and Hadoop to process four terabytes of scanned public articles and convert them to PDFs. For more information, read the <a href="http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-super-computing-fun/">&#8220;Self-Service, Prorated Supercomputing Fun!&#8221;</a> article by Derek Gottfrid.</p>
<p>The Java code of the Map class is shown below.</p>
<p><span style="font-family: Courier New;">import java.io.IOException; </span></p>
<p><span style="font-family: Courier New;">import org.apache.hadoop.io.IntWritable; </span></p>
<p><span style="font-family: Courier New;">import org.apache.hadoop.io.LongWritable; </span></p>
<p><span style="font-family: Courier New;">import org.apache.hadoop.io.Text; </span></p>
<p><span style="font-family: Courier New;">import org.apache.hadoop.mapreduce.Mapper; </span></p>
<p><span style="font-family: Courier New;">public class MaxTemperatureMapper </span></p>
<p><span style="font-family: Courier New;">extends Mapper&lt;LongWritable, Text, Text, IntWritable&gt; { </span></p>
<p><span style="font-family: Courier New;">private static final int MISSING = 9999;</span></p>
<p><span style="font-family: Courier New;">@Override </span></p>
<p><span style="font-family: Courier New;">public void map(LongWritable key, Text value, Context context) </span></p>
<p><span style="font-family: Courier New;">throws IOException, InterruptedException { </span></p>
<p><span style="font-family: Courier New;">String line = value.toString(); </span></p>
<p><span style="font-family: Courier New;"><strong>String year = line.substring(15, 19); </strong></span></p>
<p><span style="font-family: Courier New;">int airTemperature; </span></p>
<p><span style="font-family: Courier New;">if (line.charAt(87) == &#8216;+&#8217;) { // parseInt doesn&#8217;t like leading plus signs </span></p>
<p><span style="font-family: Courier New;"><strong>airTemperature = Integer.parseInt(line.substring(88, 92)); </strong></span></p>
<p><span style="font-family: Courier New;">} else { </span></p>
<p><span style="font-family: Courier New;"><strong>airTemperature = Integer.parseInt(line.substring(87, 92)); </strong></span></p>
<p><span style="font-family: Courier New;">} </span></p>
<p><span style="font-family: Courier New;">String quality = line.substring(92, 93); </span></p>
<p><span style="font-family: Courier New;">if (airTemperature != MISSING &amp;&amp; quality.matches(&#8220;[01459]&#8221;)) { </span></p>
<p><span style="font-family: Courier New;">context.write(new Text(year), new IntWritable(airTemperature)); </span></p>
<p><span style="font-family: Courier New;">} </span><span style="font-family: Courier New;">} </span><span style="font-family: Courier New;">} </span></p>
<p>The code simply parses the input file line by line to extract the year and temperature reading from the fixed-width input file. So, no surprises here. Imagine, you&#8217;re an ETL developer and decide to use code to parse a file instead of using the SSIS Flat File Source, which relies on a data provider to do the parsing for you. However, what&#8217;s interesting in Hadoop is that the framework is intrinsically parallel and distributes the ETL job on multiple nodes. The map function extracts the year and the air temperature and writes them to the Context object.</p>
<p><span style="font-family: Courier New;">(1950, 0) </span></p>
<p><span style="font-family: Courier New;">(1950, 22) </span></p>
<p><span style="font-family: Courier New;">(1950, </span><span style="font-family: Times New Roman;">−</span><span style="font-family: Courier New;">11) </span></p>
<p><span style="font-family: Courier New;">(1949, 111) </span></p>
<p><span style="font-family: Courier New;">(1949, 78) </span></p>
<p>Next, Hadoop processes the output, sorts it and converts it into key-value pairs. In this case, the year is the key, the values are the temperature readings.</p>
<p><span style="font-family: Courier New;">(1949, [111, 78]) </span></p>
<p><span style="font-family: Courier New;">(1950, [0, 22, </span><span style="font-family: Times New Roman;">−</span><span style="font-family: Courier New;">11])</span></p>
<h1>Understanding the Reduce Job</h1>
<h1><span style="font-size: 11pt;">The Reducer class is simple: </span></h1>
<p><span style="font-family: Courier New;">import java.io.IOException; </span></p>
<p><span style="font-family: Courier New;">import org.apache.hadoop.io.IntWritable; </span></p>
<p><span style="font-family: Courier New;">import org.apache.hadoop.io.Text; </span></p>
<p><span style="font-family: Courier New;">import org.apache.hadoop.mapreduce.Reducer; </span></p>
<p><span style="font-family: Courier New;">public class MaxTemperatureReducer </span></p>
<p><span style="font-family: Courier New;">extends Reducer&lt;Text, IntWritable, Text, IntWritable&gt; { </span></p>
<p><span style="font-family: Courier New;">@Override </span></p>
<p><span style="font-family: Courier New;">public void reduce(Text key, Iterable&lt;IntWritable&gt; values, </span></p>
<p><span style="font-family: Courier New;">Context context) </span></p>
<p><span style="font-family: Courier New;">throws IOException, InterruptedException { </span></p>
<p><span style="font-family: Courier New;">int maxValue = Integer.MIN_VALUE; </span></p>
<p><span style="font-family: Courier New;">for (IntWritable value : values) { </span></p>
<p><span style="font-family: Courier New;">maxValue = Math.max(maxValue, value.get()); </span></p>
<p><span style="font-family: Courier New;">} </span></p>
<p><span style="font-family: Courier New;">context.write(key, new IntWritable(maxValue)); </span></p>
<p><span style="font-family: Courier New;">} </span></p>
<p><span style="font-family: Courier New;">} </span></p>
<p>For each key-pair (year), the reducer job loops through pair values (temperature reading) and returns the maximum temperature.</p>
<p><span style="color: #17365d; font-size: 15pt;"><strong>Understanding the Application </strong></span></p>
<p>Finally, you need an application that ties the Map and Reduce classes.</p>
<p><span style="font-family: Courier New;">import org.apache.hadoop.fs.Path; </span></p>
<p><span style="font-family: Courier New;">import org.apache.hadoop.io.IntWritable; </span></p>
<p><span style="font-family: Courier New;">import org.apache.hadoop.io.Text; </span></p>
<p><span style="font-family: Courier New;">import org.apache.hadoop.mapreduce.Job; </span></p>
<p><span style="font-family: Courier New;">import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; </span></p>
<p><span style="font-family: Courier New;">import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; </span></p>
<p><span style="font-family: Courier New;">public class MaxTemperature { </span></p>
<p><span style="font-family: Courier New;">public static void main(String[] args) throws Exception { </span></p>
<p><span style="font-family: Courier New;">if (args.length != 2) { </span></p>
<p><span style="font-family: Courier New;">System.err.println(&#8220;Usage: MaxTemperature &lt;input path&gt; &lt;output path&gt;&#8221;); </span></p>
<p><span style="font-family: Courier New;">System.exit(-1); </span></p>
<p><span style="font-family: Courier New;">} </span></p>
<p><span style="font-family: Courier New;">Job job = new Job(); </span></p>
<p><span style="font-family: Courier New;">job.setJarByClass(MaxTemperature.class); </span></p>
<p><span style="font-family: Courier New;">job.setJobName(&#8220;Max temperature&#8221;); </span></p>
<p><span style="font-family: Courier New;">FileInputFormat.addInputPath(job, new Path(args[0])); </span></p>
<p><span style="font-family: Courier New;">FileOutputFormat.setOutputPath(job, new Path(args[1])); </span></p>
<p><span style="font-family: Courier New;">job.setMapperClass(MaxTemperatureMapper.class); </span></p>
<p><span style="font-family: Courier New;">job.setReducerClass(MaxTemperatureReducer.class); </span></p>
<p><span style="font-family: Courier New;">job.setOutputKeyClass(Text.class); </span></p>
<p><span style="font-family: Courier New;">job.setOutputValueClass(IntWritable.class); </span></p>
<p><span style="font-family: Courier New;">System.exit(job.waitForCompletion(true) ? 0 : 1); </span></p>
<p><span style="font-family: Courier New;">} </span><span style="font-family: Courier New;">} </span></p>
<h1>Summary</h1>
<p>Although simple and unassuming, the MaxTemperature application demonstrates a few aspects of the Hadoop inner workings:</p>
<ol>
<li>You copy the input datasets (presumably huge files) to the Hadoop distributed file system (HDFS). Hadoop shreds the files into 64 MB blocks. Then, it replicates each block three times (assuming triple replication configuration): to the node where the command is executed, and two additional nodes if you have a multi-node Hadoop cluster to provide fault tolerance. If a node fails, the file can still be assembled from the working nodes.</li>
<li>The programmer writes Java code to implement a map job, a reduce job, and an application that invokes them.</li>
<li>The Hadoop framework parallelizes and distributes the jobs to move the MapReduce computation to each node hosting a part of the input dataset. Behind the scenes, Hadoop creates a JobTracker job on the name node and TaskTracker jobs that run on the data nodes to manage the tasks and report progress back to the JobTracker job. If a task fails, the JobTracker can reschedule the job to run on a different tasktracker.</li>
<li>Once the map jobs are done, the sorted map outputs are received by the node where the reduce job(s) are running. The reduce job merges the sorted outputs and writes the result in an output file stored in the Hadoop file system for reliability.</li>
<li>Hadoop is a batch processing system. Jobs are started, processed, and their output is written to disk.</li>
</ol>
<p>&nbsp;</p>
]]></content:encoded>
					
					<wfw:commentRss>https://prologika.com/ms-bi-guy-does-hadoop-part-2-taking-hadoop-for-a-spin/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>MS BI Guy Does Hadoop (Part 1 – Getting Started)</title>
		<link>https://prologika.com/ms-bi-guy-does-hadoop-part-1-getting-started/</link>
					<comments>https://prologika.com/ms-bi-guy-does-hadoop-part-1-getting-started/#respond</comments>
		
		<dc:creator><![CDATA[Prologika - Teo Lachev]]></dc:creator>
		<pubDate>Mon, 04 Jun 2012 02:45:00 +0000</pubDate>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[Hadoop]]></category>
		<guid isPermaLink="false">/CS/blogs/blog/archive/2012/06/03/ms-bi-guy-does-hadoop-part-1-getting-started.aspx</guid>

					<description><![CDATA[With Big Data and Hadoop getting a lot of attention nowadays, I&#8217;ve decided it&#8217;s time to take a look so I am starting a log of my Hadoop adventures. I [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><span style="font-family: Times New Roman; font-size: 12pt;">With Big Data and Hadoop getting a lot of attention nowadays, I&#8217;ve decided it&#8217;s time to take a look so I am starting a log of my Hadoop adventures. I hope it&#8217;ll benefit Windows users especially BI pros. If not, at least I&#8217;ll keep a track of my experience, so I can recreate it if needed. Before I start, for an excellent introduction to Hadoop from a Microsoft perspective, watch the talk on <a href="http://www.sqlpass.org/summit/2011/Live/LiveStreaming/LiveStreamingFriday.aspx">Big Data &#8211; What is the Big Deal?</a> by David Dewitt. Previously, I&#8217;ve experimented and I got my feed wet with Apache Hadoop-based Services for Windows Azure, which is the Microsoft implementation for Hadoop in the cloud, but I was thirsty for more and wanted to dig deeper. Microsoft is currently working on CTP of <a href="http://social.technet.microsoft.com/wiki/contents/articles/6204.hadoop-based-services-for-windows-en-us.aspx">Hadoop-based Services For Windows</a>, which will provide a supported environment for installing and running Hadoop on Windows. While waiting, O&#8217;Reilly was king enough to send me a review copy of <a href="http://shop.oreilly.com/product/0636920021773.do">Hadoop – The Definitive Guide, 3<sup>rd</sup> Edition</a>, by Tom White. Since Hadoop is an open-source project, I had to rediscover and relearn something I thought I would never had to since my university days – Unix, or to be more precise its <a href="http://en.wikipedia.org/wiki/CentOS">CentOS</a> Linux variant which is installed on the Cloudera VM. So, part 1 is about setting up your environment. </span></p>
<p><span style="font-family: Times New Roman; font-size: 12pt;">From the book, I discovered that Cloudera has a virtual machine for Virtual Box. I have VirtualBox on my Windows 7 laptop so I could run SharePoint 2010 (available in x64 only). VirtualBox is a great piece of software that was originally developed by Sun Microsystems and currently owned by Oracle. So, I&#8217;ve decided to take the VM shortcut since I don&#8217;t have much time to mess around with Cygwin, Java, etc. After downloading and double-extracting the Cloudera file, I created a new VirtualBox machine and I&#8217;ve made the following changes. </span></p>
<p><img loading="lazy" decoding="async" class="alignnone wp-image-2236 size-full" src="/wp-content/uploads/2012/06/060412_0245_MSBIGuyDoes1.png" alt="060412_0245_MSBIGuyDoes1" width="484" height="315" srcset="https://prologika.com/wp-content/uploads/2012/06/060412_0245_MSBIGuyDoes1.png 484w, https://prologika.com/wp-content/uploads/2012/06/060412_0245_MSBIGuyDoes1-450x293.png 450w, https://prologika.com/wp-content/uploads/2012/06/060412_0245_MSBIGuyDoes1-300x195.png 300w" sizes="auto, (max-width: 484px) 100vw, 484px" /></p>
<p><span style="font-family: Times New Roman; font-size: 12pt;">On the next step, I increased the memory to 2GB (recommended by Cloudera). In the Virtual Hard Disk step, I chose the &#8220;Use existing hard disk&#8221; option and pointed to the vmdi file I extracted from the Cloudera downloadable. Then, in the Settings page for the new VM, I&#8217;ve changed the storage to use the IDE controller instead of SATA which Cloudera said that the VM might have an issue with. </span></p>
<p><img loading="lazy" decoding="async" class="alignnone size-medium wp-image-2238" src="/wp-content/uploads/2012/06/060412_0245_MSBIGuyDoes2-300x224.png" alt="060412_0245_MSBIGuyDoes2" width="300" height="224" srcset="https://prologika.com/wp-content/uploads/2012/06/060412_0245_MSBIGuyDoes2-300x224.png 300w, https://prologika.com/wp-content/uploads/2012/06/060412_0245_MSBIGuyDoes2-450x336.png 450w, https://prologika.com/wp-content/uploads/2012/06/060412_0245_MSBIGuyDoes2.png 482w" sizes="auto, (max-width: 300px) 100vw, 300px" /></p>
<p><span style="font-family: Times New Roman; font-size: 12pt;">Once this was done, I was able to start the VM, which automatically logged me into CentOS as user <strong>cloudera</strong>. The first challenge I had to overcome was installing the VirtualBox Guest Editions for Linux in order to be able to resize the window and move the mouse cursor in and out without having to hold the right Ctrl key. This turned out to be more difficult than expected. The final solution took the following steps: </span></p>
<ol>
<li>Once you&#8217;ve started the guest OS, in the VM menu toolbar click Install Guest Additions to mount the disk.</li>
<li>
<div>Open the File Manager and navigate to the <span style="font-family: Courier New;">/etc/yum.repos.d </span>folder. Right-click the folder and click Open Terminal Here.</div>
<p><span style="font-family: Times New Roman; font-size: 12pt;">In the command window, type the following command to elevate your privileges: </span></p>
<p><span style="font-size: 12pt;"><span style="font-family: Courier New;">$ su </span></span></p>
<p><span style="font-family: Times New Roman; font-size: 12pt;">Enter the password (claudera) when prompted </span></li>
<li>
<div>Open the Vi editor to edit the Cloudera-cdh3.repo as mentioned in the Cloudera VM demo note by typing this command.</div>
<p><span style="font-size: 12pt;"><span style="font-family: Courier New;">su -c vi Cloudera-cdh3.repo </span></span></li>
<li>
<div>Change the baseurl line (changes in bold):</div>
<p><span style="font-size: 12pt;"><span style="font-family: Courier New;">[Cloudera-cdh3] </span></span></p>
<p><span style="font-size: 12pt;"><span style="font-family: Courier New;">name=Cloudera&#8217;s Distribution for Hadoop, Version 3 </span></span></p>
<p><span style="font-size: 12pt;"><span style="font-family: Courier New;">enabled=1 </span></span></p>
<p><span style="font-size: 12pt;"><span style="font-family: Courier New;">gpgcheck=0 </span></span></p>
<p><span style="font-size: 12pt;"><span style="font-family: Courier New;">baseurl=<strong>http://archive.cloudera.com/redhat/cdh/3u4/</strong> </span></span></li>
<li>
<div>Press ESC to go to command mode and type :wq to save and exit vi.</div>
<p style="background: #d9d9d9;"><strong>Tip</strong>: To edit files in a more civilized way, click the File Manager icon in the menu bar at the bottom of the shell. However, you won&#8217;t have access to save files. As a workaround, launch the File Manager with elevated permissions as follows:</p>
<p style="background: #d9d9d9;"><span style="font-family: Courier New; font-size: 12pt;">$ su –c Thunar </span></p>
</li>
<li>
<div>Enter the following command to install a few utilities and development kernel:</div>
<p><span style="font-size: 12pt;"><span style="font-family: Courier New;">$ yum install dkms binutils gcc make patch libgomp glibc-headers glibc-devel kernel-headers kernel-devel </span></span></li>
<li>
<div>Then navigate to the media folder and run the Guest Additions file.<br />
<span style="font-family: Courier New;">$ cd /media<br />
$ cd VBOXADDITIONS_4.1.16_78094<br />
$ ./VBoxLinuxAdditions.run</span></div>
<p><span style="font-size: 12pt;"><span style="font-family: Times New Roman;">This should install the guest additions successfully. If you see any error messages, execute additional packages with </span><span style="font-family: Courier New;">yum</span><span style="font-family: Times New Roman;"> as requested. </span></span></li>
</ol>
<p><span style="font-family: Times New Roman; font-size: 12pt;">Next, you can verify the Hadoop installation by executing the steps in the Starting Hadoop and Verifying it is Working Properly section in the Hadoop Quick Start Guide. </span></p>
]]></content:encoded>
					
					<wfw:commentRss>https://prologika.com/ms-bi-guy-does-hadoop-part-1-getting-started/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
	</channel>
</rss>
