HDFS important and useful commands

To check Hadoop Version

$ hadoop version

Creating user home directory

hadoop fs -mkdir -p /user/dpq  (-p will create directory if directory not present if directory already present then it wont throw exception)

hadoop fs -mkdir  /user/retails (without -p  if directory already present then it will throw exception)

List all directories

hdfs dfs -ls /

hdfs dfs -ls /user

Copy file from local to HDFS

hadoop fs -copyFromLocal /etc/data/retial_data.csv /user/retailshadoop fs -put /etc/data/retial_data.csv /user/retails (if 'retail_data.csv' already present inside '/user/retails' hdfs directory then it will throw excpetion 'File already present')hadoop fs -put -f /etc/data/retial_data.csv /user/retails (-f will forcefully override retail_data.csv if already present)

Checking data in HDFS

hdfs dfs -cat /user/retails/retial_data.csv

Create empty file on HDFS

hdfs dfs -touchz /user/retails/empty.csv

Copy file from HDFS

hdfs dfs -get /user/retails/retial_data.csv

hdfs dfs -get /user/retails

Copy files in HDFS from one location to another location

hdfs dfs -cp  <src(on hdfs)>  <dest(on hdfs)>hdfs dfs -cp /user/retails/empty.csv /user/dpq/

To rename file or location

hdfs dfs -mv <src(on hdfs)> <src(on hdfs)>
hdfs dfs -mv /user/retails/empty.csv /user/dpq/emp.csv (changing file name)
hdfs dfs -mv /user/retails/ /user/stocks/ (changing hdfs location now emp.csv will be present under stocks hdfs directory)

To remove file or directory

hdfs dfs -rmr /user/stocks/emp.csv (it will remove emp.csv file)

hdfs dfs -rmr /user/stocks (it will remove stocks)

du: It will give the size of each file in directory

hdfs dfs -du /geeks

dus:: This command will give the total size of directory/file

hdfs dfs -dus /geeks

stat: It will give the last modified time of directory or path. In short it will give stats of the directory or file.

hdfs dfs -stat /geeks

setrep: This command is used to change the replication factor of a file/directory in HDFS. By default it is 3 for anything which is stored in HDFS (as set in hdfs core-site.xml).

Example 1: To change the replication factor to 6 for geeks.txt stored in HDFS.

hdfs dfs -setrep -R -w 6 geeks.txt

Example 2: To change the replication factor to 4 for a directory geeksInput stored in HDFS.

hdfs dfs -setrep -R  4 /geeks

Note: The -w means wait till the replication is completed. And -R means recursively, we use it for directories as they may also contain many files and folders inside them.

Note: There are more commands in HDFS but we discussed the commands which are commonly used when working with Hadoop. You can check out the list of dfs commands using the following command:

hdfs dfs

Few more commands:

hdfs dfs -ls /data/dpq//dummy/
hdfs dfs -mkdir -p /data/dpq/dummy/temp
hdfs dfs -put /data/sparkSetup/files/*.* /data/dpq/dummy/temp/
hdfs dfs -ls /data/dpq/dummy/temp/ –default replication will be 3
hdfs dfs -rm /data/dpq/dummy/temp/GDGA111_1_20180918111340752.csv — it will trash

To change replication factor below command will be used while storing data to HDFS
hdfs dfs -Ddfs.replication=1 -put /data/dpq/dummy/temp/GDGA111_1_20180918111340752.csv /data/dpq/temp

hdfs dfs -get /data/dpq/temp/GDGA111_1_20180918111340752.csv
hdfs dfs -cat /data/dpq/temp/GDGA111_1_20180918111340752.ctrl

hdfs dfs -du -h /data/dpq/temp/ — size with replication can be checked here

hdfs dfs -chmod 777 /data/dpq/temp/GDGA111_1_20180918111340752.csv

hdfs dfs -mkdir -p /data/dpq/model_temp
hdfs dfs -ls /data/dpq/
hdfs dfs -chmod 777 /data/dpq/model_temp
hdfs dfs -ls /data/dpq/

hdfs dfs -touchz /data/dpq/model_temp/1.txt –create empty file

Popular posts from this blog

Window function in PySpark with Joins example using 2 Dataframes (inner join)

Complex SQL: fetch the users who logged in consecutively 3 or more times (lead perfect example)

Credit Card Data Analysis using PySpark (how to use auto broadcast join after disabling it)