tag:blogger.com,1999:blog-72618095506470522802024-03-13T19:17:41.879+01:00DAPLab - Data Analysis and Processing LabReduces the entry barrier for companies to find value out of their data and ultimately turn into a data-driven companyVincenthttp://www.blogger.com/profile/02951964735917349036noreply@blogger.comBlogger3125tag:blogger.com,1999:blog-7261809550647052280.post-9477606630943243212015-04-24T12:50:00.002+02:002023-02-14T15:07:51.275+01:00Fault Tolerant Twitter firehose Ingestion on YARN
<div style="box-sizing: border-box; color: #333333; font-family: 'Helvetica Neue', Helvetica, 'Segoe UI', Arial, freesans, sans-serif; font-size: 16px; line-height: 25.6000003814697px; margin-bottom: 16px;">
<div style="box-sizing: border-box; line-height: 25.6000003814697px; margin-bottom: 16px;">
<a href="http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html" style="box-sizing: border-box; color: #4183c4; text-decoration: none;">YARN</a>, aka NextGen MapReduce, is awesome for building fault-tolerant distributed applications. But writing <a href="http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html" style="box-sizing: border-box; color: #4183c4; text-decoration: none;">plain YARN application</a> is far than trivial and might even be a <a href="http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/yarn_either_it_is_really" style="box-sizing: border-box; color: #4183c4; text-decoration: none;">show-stopper to lots of engineers</a>.</div>
<div style="box-sizing: border-box; line-height: 25.6000003814697px; margin-bottom: 16px;">
The good news is that a framework to simplify interaction with YARN emerged and met the Apache foundation: <a href="http://twill.incubator.apache.org/" style="box-sizing: border-box; color: #4183c4; text-decoration: none;">Apache Twill</a>. While still in the incubation phase, the project looks really promising and allow to write (easier to test) <a href="http://docs.oracle.com/javase/8/docs/api/java/lang/Runnable.html" style="box-sizing: border-box; color: #4183c4; text-decoration: none;">Runnable</a> application and run them on YARN.</div>
<span style="line-height: 25.6000003814697px;">As part of the <a href="http://daplab.ch/">DAPLAB</a> Hacky Thursday, we jumped head first into Twill, <a href="https://github.com/ReactiveX/RxJava">RxJava</a> and <a href="http://twitter4j.org/">Twitter4j</a>, all bundled together to build a fault tolerant Twitter firehose ingestion application storing the tweets into HDFS.</span><br />
<span style="line-height: 25.6000003814697px;"><br /></span>
<span style="line-height: 25.6000003814697px;">We used Twill </span><a href="http://twill.incubator.apache.org/apidocs-0.5.0-incubating/index.html" style="box-sizing: border-box; color: #4183c4; line-height: 25.6000003814697px; outline: 0px;">version 0.5.0-incubating</a><span style="line-height: 25.6000003814697px;">. </span><span style="line-height: 25.6000003814697px;">Read more on Twill</span><span style="line-height: 25.6000003814697px;"> </span><a href="http://blog.cask.co/2014/01/programming-with-apache-twill-part-ii/" style="box-sizing: border-box; color: #4183c4; line-height: 25.6000003814697px; text-decoration: none;">here</a><span style="line-height: 25.6000003814697px;">,</span><span style="line-height: 25.6000003814697px;"> </span><a href="http://blog.cask.co/2013/06/simplifying-yarn-introducing-weave-to-the-apache/" style="box-sizing: border-box; color: #4183c4; line-height: 25.6000003814697px; text-decoration: none;">here</a><span style="line-height: 25.6000003814697px;"> </span><span style="line-height: 25.6000003814697px;">and </span><a href="http://jaxenter.com/developing-distributed-applications-with-apache-twill-107728.html" style="box-sizing: border-box; color: #4183c4; line-height: 25.6000003814697px; text-decoration: none;">here</a><span style="line-height: 25.6000003814697px;">.</span><br />
<br />
<span style="line-height: 25.6000003814697px;">Twitter4j has been wrapped as an </span><a href="http://reactivex.io/RxJava/javadoc/rx/Observable.OnSubscribe.html" style="box-sizing: border-box; color: #4183c4; line-height: 25.6000003814697px; text-decoration: none;">RxJava Observable</a><span style="line-height: 25.6000003814697px;"> object, and is attached to and HDFS sink, partitioning the data by </span><span style="background-color: #f7f7f7; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 13.6000003814697px; line-height: 1.45;">year/month/day/hour/minute</span><span style="line-height: 25.6000003814697px;">. This will be useful to create <a href="http://hive.apache.org/">hive</a> tables later on, with proper partitions.</span><br />
<h3 style="box-sizing: border-box; font-size: 1.5em; line-height: 1.43; margin-bottom: 16px; margin-top: 1em; position: relative;">
Check it out</h3>
<div>
The sources of the project are available on github: <a href="https://github.com/daplab/yarn-starter">https://github.com/daplab/yarn-starter</a></div>
<div>
<pre style="background-color: #f7f7f7; border-radius: 3px; box-sizing: border-box; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 13.6000003814697px; font-stretch: normal; line-height: 1.45; margin-bottom: 16px; overflow: auto; padding: 16px; word-wrap: normal;"><code style="background: transparent; border-radius: 3px; border: 0px; box-sizing: border-box; display: inline; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 13.6000003814697px; line-height: inherit; margin: 0px; max-width: initial; overflow: initial; padding: 0px; word-break: normal; word-wrap: normal;">git clone https://github.com/daplab/yarn-starter.git</code></pre>
</div>
</div>
<h3 style="box-sizing: border-box; color: #333333; font-family: 'Helvetica Neue', Helvetica, 'Segoe UI', Arial, freesans, sans-serif; font-size: 1.5em; line-height: 1.43; margin-bottom: 16px; margin-top: 1em; position: relative;">
<a aria-hidden="true" class="anchor" href="https://github.com/daplab/yarn-starter#configure-it" id="user-content-configure-it" style="box-sizing: border-box; color: #4183c4; display: block; left: 0px; line-height: 1.2; margin-left: -30px; padding-left: 30px; padding-right: 6px; position: absolute; text-decoration: none; top: 0px;"></a>Configure it</h3>
<div style="box-sizing: border-box; color: #333333; font-family: 'Helvetica Neue', Helvetica, 'Segoe UI', Arial, freesans, sans-serif; font-size: 16px; line-height: 25.6000003814697px; margin-bottom: 16px;">
The Twitter keys and secrets are currently hardcoded in <a href="https://github.com/daplab/yarn-starter/blob/master/src/main/java/ch/daplab/yarn/twitter/rx/TwitterObservable.java" style="box-sizing: border-box; color: #4183c4; text-decoration: none;"><code style="background-color: rgba(0, 0, 0, 0.0392157); border-radius: 3px; box-sizing: border-box; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 13.6000003814697px; margin: 0px; padding: 0.2em 0px;">TwitterObservable.java</code></a> (yeah, it's in the<a href="https://github.com/daplab/yarn-starter/issues/1" style="box-sizing: border-box; color: #4183c4; text-decoration: none;">TODO list</a> :)). Please set them there <em style="box-sizing: border-box;">before</em> building.</div>
<h3 style="box-sizing: border-box; color: #333333; font-family: 'Helvetica Neue', Helvetica, 'Segoe UI', Arial, freesans, sans-serif; font-size: 1.5em; line-height: 1.43; margin-bottom: 16px; margin-top: 1em; position: relative;">
<a aria-hidden="true" class="anchor" href="https://github.com/daplab/yarn-starter#build-it" id="user-content-build-it" style="box-sizing: border-box; color: #4183c4; display: block; left: 0px; line-height: 1.2; margin-left: -30px; padding-left: 8px; padding-right: 6px; position: absolute; text-decoration: none; top: 0px;"><span class="octicon octicon-link" style="-webkit-font-smoothing: antialiased; -webkit-user-select: none; box-sizing: border-box; color: black; display: inline-block; font-family: octicons; font-size: 16px; font-stretch: normal; font-weight: normal; line-height: 1; text-rendering: auto; vertical-align: middle;"></span></a>Build it</h3>
<pre style="background-color: #f7f7f7; border-radius: 3px; box-sizing: border-box; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 13.6000003814697px; font-stretch: normal; line-height: 1.45; margin-bottom: 16px; overflow: auto; padding: 16px; word-wrap: normal;"><code style="background: transparent; border-radius: 3px; border: 0px; box-sizing: border-box; display: inline; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 13.6000003814697px; line-height: inherit; margin: 0px; max-width: initial; overflow: initial; padding: 0px; word-break: normal; word-wrap: normal;">mvn clean install
</code></pre>
<h3 style="box-sizing: border-box; color: #333333; font-family: 'Helvetica Neue', Helvetica, 'Segoe UI', Arial, freesans, sans-serif; font-size: 1.5em; line-height: 1.43; margin-bottom: 16px; margin-top: 1em; position: relative;">
<a aria-hidden="true" class="anchor" href="https://github.com/daplab/yarn-starter#run-it" id="user-content-run-it" style="box-sizing: border-box; color: #4183c4; display: block; left: 0px; line-height: 1.2; margin-left: -30px; padding-left: 30px; padding-right: 6px; position: absolute; text-decoration: none; top: 0px;"></a>Run it</h3>
<div style="box-sizing: border-box; color: #333333; font-family: 'Helvetica Neue', Helvetica, 'Segoe UI', Arial, freesans, sans-serif; font-size: 16px; line-height: 25.6000003814697px; margin-bottom: 16px;">
And Run it in the <a href="http://daplab.ch/" style="box-sizing: border-box; color: #4183c4; text-decoration: none;">DAPLAB</a> infrastucture like this:</div>
<pre style="background-color: #f7f7f7; border-radius: 3px; box-sizing: border-box; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 13.6000003814697px; font-stretch: normal; line-height: 1.45; margin-bottom: 16px; overflow: auto; padding: 16px; word-wrap: normal;"><code style="background: transparent; border-radius: 3px; border: 0px; box-sizing: border-box; display: inline; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 13.6000003814697px; line-height: inherit; margin: 0px; max-width: initial; overflow: initial; padding: 0px; word-break: normal; word-wrap: normal;">./src/main/scripts/start-twitter-ingestion-app.sh daplab-wn-22.fri.lan:2181
</code></pre>
<div style="box-sizing: border-box; color: #333333; font-family: 'Helvetica Neue', Helvetica, 'Segoe UI', Arial, freesans, sans-serif; font-size: 16px; line-height: 25.6000003814697px; margin-bottom: 16px;">
By default data is stored under <code style="background-color: rgba(0, 0, 0, 0.0392157); border-radius: 3px; box-sizing: border-box; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 13.6000003814697px; margin: 0px; padding: 0.2em 0px;">/tmp/twitter/firehose</code>, monitor the ingestion process:</div>
<pre style="background-color: #f7f7f7; border-radius: 3px; box-sizing: border-box; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 13.6000003814697px; font-stretch: normal; line-height: 1.45; margin-bottom: 16px; overflow: auto; padding: 16px; word-wrap: normal;"><code style="background: transparent; border-radius: 3px; border: 0px; box-sizing: border-box; display: inline; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 13.6000003814697px; line-height: inherit; margin: 0px; max-width: initial; overflow: initial; padding: 0px; word-break: normal; word-wrap: normal;">hdfs dfs -ls -R /tmp/twitter/firehose
...
-rw-r--r-- 3 yarn hdfs 7469136 2015-04-24 09:59 /tmp/twitter/firehose/2015/04/24/07/58.json
-rw-r--r-- 3 yarn hdfs 6958213 2015-04-24 10:00 /tmp/twitter/firehose/2015/04/24/07/59.json
drwxrwxrwx - yarn hdfs 0 2015-04-24 10:01 /tmp/twitter/firehose/2015/04/24/08
-rw-r--r-- 3 yarn hdfs 9444337 2015-04-24 10:01 /tmp/twitter/firehose/2015/04/24/08/00.json
...
</code></pre>
<div style="box-sizing: border-box; color: #333333; font-family: 'Helvetica Neue', Helvetica, 'Segoe UI', Arial, freesans, sans-serif; font-size: 16px; line-height: 25.6000003814697px;">
That's it, now you can kill the application and see how it will be restarted by YARN!</div>
Anonymoushttp://www.blogger.com/profile/09328084908582241808noreply@blogger.comtag:blogger.com,1999:blog-7261809550647052280.post-21128261939632848822015-04-23T13:33:00.001+02:002023-02-14T15:07:41.526+01:00Data Ingestion - Homogeneous meteorological dataHomogeneous monthly values of temperature and precipitation for 14 stations from 1864 until today. Yearly values are averaged for whole Switzerland Since 1864.<br />
<div>
<br />
<h2>
Data set</h2>
<h3>
Explanation</h3>
<div>
<br /></div>
<div>
The file is a .txt and contains a four rows headers. </div>
<div>
<pre class="console">MeteoSchweiz / MeteoSuisse / MeteoSvizzera / MeteoSwiss
stn|time|rhs150m0|ths200m0
||[mm]|[°C]</pre>
<br />
Data are separated by a "|" and do not contains blankspace.
</div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<table>
<thead>
<tr><th>#</th>
<th>Name</th>
<th>Example</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>stn</td>
<td>BAS</td>
<td><pre>Station's names
BAS: Basel / Binningen
BER: Bern / Zollikofen
CHM: Chaumont
CHD: Château-d'Oex
GSB: Col du Grand St-Bernard
DAV: Davos
ENG: Engelberg
GVE: Genève-Cointrin
LUG: Lugano
PAY: Payerne
SIA: Segl-Maria
SIO: Sion
SAE: Säntis
SMA: Zürich / Fluntern</pre>
</td>
</tr>
<tr>
<td>2</td>
<td>time</td>
<td>201503</td>
<td>Year and months of the measure. Format: yyyyMM</td>
</tr>
<tr>
<td>3</td>
<td>rhs150m0</td>
<td>36.9</td>
<td>Sum of precipitation in mm at 1.5 meter</td>
</tr>
<tr>
<td>4</td>
<td>ths200m0</td>
<td>4.5</td>
<td>Mean temperature in degree Celsius at 2 meters</td>
</tr>
</tbody>
</table>
</div>
<br />
<h3>
Data update</h3>
</div>
<div>
On OpenData, the files are created every day, however the data set changes monthly. </div>
<div>
<br /></div>
<h3>
Data Access</h3>
<div>
<div lang="fr-CH" style="margin-bottom: 0in;">
<div style="line-height: 100%;">
<a href="http://data.geo.admin.ch.s3.amazonaws.com/ch.meteoschweiz.homogenereihen/VQAA60.txt">http://data.geo.admin.ch.s3.amazonaws.com/ch.meteoschweiz.homogenereihen/VQAA60.txt</a></div>
<div style="line-height: 100%;">
<br /></div>
<h3 style="line-height: 100%;">
Warnings</h3>
<div style="line-height: 100%;">
Some of stations do not have data like Payerne or others stations did not exist in 1864. In consequence, data must be filtered before used for statistics.<br />
<br />
<h2>
Hive</h2>
<h3>
<span style="line-height: 100%;">Creating a database, a table and loading data</span></h3>
</div>
<div>
Note: In order to have this tutorial to work for everybody, it will create a database prefixed by your username (${env:USER] inside hive)<br />
1. Downloading the data and remove the header<br />
<pre class="console">$ wget http://data.geo.admin.ch.s3.amazonaws.com/ch.meteoschweiz.homogenereihen/VQAA60.txt
$ tail -n +5 VQAA60.txt > VQAA60.txt.new && mv VQAA60.txt.new VQAA60.txt
</pre>
2. Copy the locally unzipped data into your home folder in HDFS (the tailing "." points you to /user/$(whoami)). See <a href="http://daplabch.blogspot.ch/2015/03/hdfshelloworld.html" target="_blank">HDFSHelloWorld</a> if you're not familiar with HDFS.<br />
<pre class="console">$ hdfs dfs -copyFromLocal VQAA60.txt</pre>
3. Create a database in Hive
<br />
<pre class="console">$ hive
create database ${env:USER}_meteo;</pre>
4. Create a temp table to load the file into
<br />
<pre class="console">$ hive
use ${env:USER}_meteo;
create table temp_meteo (col_value STRING);
LOAD DATA INPATH '/user/${env:USER}/VQAA60.txt' OVERWRITE INTO TABLE temp_meteo; </pre>
5. Create the final table and insert the data into. The date are in format YYYYMM, they will be cast in string because of the request language. It is easier to extract the year from a string than an int
<br />
<pre class="console">$ hive
use ${env:USER}_meteo;
create table meteo (station STRING, date STRING, precipitation FLOAT, temperature FLOAT);
insert overwrite table meteo
SELECT
regexp_extract(col_value, '^(?:([^\|]*)\.){1}', 1) station,
regexp_extract(col_value, '^(?:([^\|]*)\.){2}', 1) date,
regexp_extract(col_value, '^(?:([^\|]*)\.){3}', 1) precipitation,
regexp_extract(col_value, '^(?:([^\|]*)\.){4}', 1) temperature
from temp_meteo;
</pre>
6. Run your first query
<br />
<pre class="console">$ hive --database ${USER}_meteo
SELECT station, avg(precipitation) as mean_precipitation, avg(temperature) as mean_temperature FROM meteo GROUP BY station;
</pre>
7. Woot!
<br />
8. Run a more complex query. Summerize the precipitation by station and year
<br />
<pre class="console">$ hive --database ${USER}_meteo
SELECT station, dateYear, sum(precipitation) as sumPre
FROM (SELECT substring(date,1,4) as dateYear, precipitation, station FROM meteo) as T1
GROUP BY dateYear, station
ORDER BY sumPre desc
</pre>
</div>
</div>
<div lang="fr-CH" style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div lang="fr-CH" style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
</div>Vincenthttp://www.blogger.com/profile/02951964735917349036noreply@blogger.comtag:blogger.com,1999:blog-7261809550647052280.post-26966462844226575052015-03-31T14:13:00.002+02:002023-02-14T15:07:31.593+01:00HDFSHelloWorld<br />
This page aims at creating a "copy-paste"-like tutorial to familiarize with <a href="http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html">HDFS commands</a> . It mainly focuses on user commands (uploading and downloading data into HDFS).<br />
<h1>
Requirements</h1>
<ul>
<li>SSH (for Windows, use <a href="http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html" target="_blank">PuTTY</a> <span style="background-color: white; color: #252525; font-size: 14px; line-height: 1.5em;">and see <a href="https://www.digitalocean.com/community/tutorials/how-to-use-ssh-keys-with-putty-on-digitalocean-droplets-windows-users" target="_blank">how to create a key with PuTTY</a></span>)</li>
<li>An account in the <a href="http://daplab.ch/" target="_blank">DAPLAB</a>, and send your ssh public key to Benoit.</li>
<li>A browser -- well, if you can access this page, you should have met this requirement :)</li>
</ul>
<div>
<h1>
Resources</h1>
<br />
While the source of truth for HDFS commands is the code source, the documentation page describing the <code style="background-color: #f9f9f9; border-radius: 2px; border: 1px solid rgb(221, 221, 221); color: black; font-family: monospace, Courier; padding: 1px 4px;">hdfs dfs</code> commands is really useful:</div>
<div>
<ul style="background-color: white; color: #252525; font-size: 14px; line-height: 1.5em;">
<li style="margin-bottom: 0.1em;"><a href="http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html">http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html</a></li>
</ul>
</div>
<h1>
Basic Manipulations</h1>
<h3>
Listing a folder</h3>
<h4>
Your home folder</h4>
<div>
<pre class="console">$ hdfs dfs -ls
Found 28 items
...
-rw-r--r-- 3 bperroud daplab_user 6398990 2015-03-13 11:01 data.csv
...
^^^^^^^^^^ ^ ^^^^^^^^ ^^^^^^^^^^^ ^^^^^^^ ^^^^^^^^^^ ^^^^^ ^^^^^^^^
1 2 3 4 5 6 7 8
</pre>
<div style="background-color: white; color: #252525; font-size: 14px; line-height: 22.3999996185303px; margin-bottom: 0.5em; margin-top: 0.5em;">
Columns, as numbered below, represent:</div>
<ol style="background-color: white; color: #252525; font-size: 14px; line-height: 1.5em; list-style-image: none; margin: 0.3em 0px 0px 3.2em; padding: 0px;">
<li style="margin-bottom: 0.1em;">Permissions, in <a class="external text" href="http://en.wikipedia.org/wiki/File_system_permissions#Notation_of_traditional_Unix_permissions" rel="nofollow" style="background: linear-gradient(transparent, transparent) 100% 50% no-repeat, url(data:image/svg+xml; color: #663366; padding-right: 13px; text-decoration: none;">unix-style</a> syntax</li>
<li style="margin-bottom: 0.1em;">Replication factor (RF in short), default being 3 for a file. Directories have a RF of 0.</li>
<li style="margin-bottom: 0.1em;">Owner</li>
<li style="margin-bottom: 0.1em;">Group owning the file</li>
<li style="margin-bottom: 0.1em;">Size of the file, in bytes. Note that to compute the physical space used, this number should be multiplied by the RF.</li>
<li style="margin-bottom: 0.1em;">Modification date. As HDFS is mostly a <i><a class="external text" href="http://en.wikipedia.org/wiki/Write_once_read_many" rel="nofollow" style="background: linear-gradient(transparent, transparent) 100% 50% no-repeat, url(data:image/svg+xml; color: #663366; padding-right: 13px;">write-once-read-many</a></i> filesystem, this date often means creation date</li>
<li style="margin-bottom: 0.1em;">Modification time. Same as date.</li>
<li style="margin-bottom: 0.1em;">Filename, within the listed folder</li>
</ol>
<div>
<span style="color: #252525;"><span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: 14px; line-height: 21px;"><br /></span></span></div>
</div>
<h4>
Listing the /tmp folder</h4>
<div>
<pre style="background-color: #f9f9f9; border: 1px solid rgb(221, 221, 221); font-family: monospace, Courier; font-size: 14px; line-height: 1.3em; padding: 1em;"><span style="font-family: Courier New, Courier, monospace;">$ hdfs dfs -ls /tmp</span></pre>
</div>
<div>
<span style="color: #252525;"><span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; line-height: 21px;"><br /></span></span></div>
<h3>
Uploading a file</h3>
<h4>
In /tmp</h4>
<div>
<pre class="console">$ hdfs dfs -copyFromLocal localfile.txt /tmp/</pre>
</div>
<div>
<div style="background-color: white; color: #252525; font-size: 14px; line-height: 22.3999996185303px; margin-bottom: 0.5em; margin-top: 0.5em;">
The first arguments after <code style="background-color: #f9f9f9; border-radius: 2px; border: 1px solid rgb(221, 221, 221); color: black; font-family: monospace, Courier; padding: 1px 4px;">-copyFromLocal</code> point to local files or folders, while the last argument is a file (if only one file listed as source) or directory in HDFS.</div>
<div style="background-color: white; color: #252525; font-size: 14px; line-height: 22.3999996185303px; margin-bottom: 0.5em; margin-top: 0.5em;">
Note: <code style="background-color: #f9f9f9; border-radius: 2px; border: 1px solid rgb(221, 221, 221); color: black; font-family: monospace, Courier; padding: 1px 4px;">hdfs dfs -put</code> is doing about the same thing, but <code style="background-color: #f9f9f9; border-radius: 2px; border: 1px solid rgb(221, 221, 221); color: black; font-family: monospace, Courier; padding: 1px 4px;">-copyFromLocal</code> is more explicit when you're uploading a local file and thus preferred.</div>
</div>
<div>
<span style="color: #252525;"><span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: 14px; line-height: 21px;"><br /></span></span></div>
<h3>
Downloading a file</h3>
<h4>
From /tmp</h4>
<div>
<pre class="console">$ hdfs dfs -copyToLocal /tmp/remotefile.txt .</pre>
</div>
<div>
<span style="color: #252525;"><span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif; font-size: 14px; line-height: 21px;"><br /></span></span></div>
<div>
<div style="background-color: white; color: #252525; font-size: 14px; line-height: 22.3999996185303px; margin-bottom: 0.5em; margin-top: 0.5em;">
The first arguments after <code class="console">-copyToLocal</code> point to files or folder in HDFS, while the last argument is a local file (if only one file listed as source) or directory.</div>
<div style="background-color: white; color: #252525; font-size: 14px; line-height: 22.3999996185303px; margin-bottom: 0.5em; margin-top: 0.5em;">
<code style="background-color: #f9f9f9; border-radius: 2px; border: 1px solid rgb(221, 221, 221); color: black; font-family: monospace, Courier; padding: 1px 4px;">hdfs dfs -get</code> is doing about the same thing, but <code style="background-color: #f9f9f9; border-radius: 2px; border: 1px solid rgb(221, 221, 221); color: black; font-family: monospace, Courier; padding: 1px 4px;">-copyToLocal</code> is more explicit when you're downloading a file and thus preferred.</div>
<div style="background-color: white; color: #252525; font-size: 14px; line-height: 22.3999996185303px; margin-bottom: 0.5em; margin-top: 0.5em;">
<br /></div>
<h3>
Creating a folder</h3>
<div style="background-color: white; color: #252525; font-size: 14px; line-height: 22.3999996185303px; margin-bottom: 0.5em; margin-top: 0.5em;">
<h4>
In your home folder</h4>
</div>
<pre class="console">$ hdfs dfs -mkdir dummy-folder</pre>
<div style="background-color: white; color: #252525; font-size: 14px; line-height: 22.3999996185303px; margin-bottom: 0.5em; margin-top: 0.5em;">
<br /></div>
<div style="background-color: white; color: #252525; font-size: 14px; line-height: 22.3999996185303px; margin-bottom: 0.5em; margin-top: 0.5em;">
<h4>
In /tmp</h4>
</div>
<pre class="console">$ hdfs dfs -mkdir /tmp/dummy-folder</pre>
<div style="background-color: white; color: #252525; font-size: 14px; line-height: 22.3999996185303px; margin-bottom: 0.5em; margin-top: 0.5em;">
<span style="line-height: 22.3999996185303px;">Note that relative paths points to your home folder, </span><code style="background-color: #f9f9f9; border-radius: 2px; border: 1px solid rgb(221, 221, 221); color: black; font-family: monospace, Courier; line-height: 22.3999996185303px; padding: 1px 4px;">/user/bperroud</code><span style="line-height: 22.3999996185303px;"> for instance.</span><br />
<span style="line-height: 22.3999996185303px;"><br /></span>
<br />
<div class="line number1 index0 alt2" style="background-attachment: initial !important; background-clip: initial !important; background-image: none !important; background-origin: initial !important; background-position: initial !important; background-repeat: initial !important; background-size: initial !important; border-radius: 0px !important; border: 0px !important; bottom: auto !important; box-sizing: content-box !important; color: #cccccc; float: none !important; font-family: Consolas, 'Bitstream Vera Sans Mono', 'Courier New', Courier, monospace; font-size: 14.8500003814697px; height: auto !important; left: auto !important; line-height: 16.3350009918213px; margin: 0px !important; min-height: inherit !important; outline: 0px !important; overflow: visible !important; padding: 0px 1em !important; position: static !important; right: auto !important; top: auto !important; vertical-align: baseline !important; white-space: pre !important; width: auto !important;">
<br /></div>
</div>
</div>
<div>
<span style="color: #252525; font-family: sans-serif;"><span style="font-size: 14px; line-height: 21px;"><br /></span></span></div>
Vincenthttp://www.blogger.com/profile/02951964735917349036noreply@blogger.com