Important Sqoop Commands:
Validation
Validation is used for validating the data during import or export by comparing the number of row count on both source and destination. Some of the examples of this are,
A basic import of a table named EMPLOYEES in the test database that uses validation to validate the row counts:
$ sqoop import --connect jdbc:mysql://db.foo.com/test \ --table EMPLOYEES --validate
Sqoop-job
We can save the jobs in Sqoop if we want to repeatedly perform same import and export commands. This makes our task easier. A saved job records all the configuration information needed to execute sqoop command.
By default this job descriptions are saved in $HOME/.sqoop/. We can also use a shared metastore which makes all the jobs available to many users in a shared cluster.
$ sqoop job --create myjob -- import --connect jdbc:mysql://example.com/db \ --table mytable
This creates a job named myjob which can be executed later. The job is not run. This job is now available in the list of saved jobs:
$ sqoop job --list Available jobs: myjob
We can inspect the configuration of a job with the show action:
$ sqoop job --show myjob Job: myjob Tool: import Options: ---------------------------- direct.import = false codegen.input.delimiters.record = 0 hdfs.append.dir = false db.table = mytable
And if we are satisfied with it, we can run the job with exec:
$ sqoop job --exec myjob 10/08/19 13:08:45 INFO tool.CodeGenTool: Beginning code generation
The exec action allows you to override arguments of the saved job by supplying them after a –. For example, if the database were changed to require a username, we could specify the username and password with:
$ sqoop job --exec myjob -- --username someuser -P Enter password:
Sqoop-merge
Merge tool is used for combining 2 datasets in which newer dataset values will overwrite older dataset. Normally we use this with the combination of incremental import. After we run the incremental import in last-modified state we get multiple datasets in HDFS then with the help of merge we get newer data in each dataset. It will flatten the 2 datasets into one and get the new records based on primary key.
Merge tool also runs MapReduce job which takes 2 directories as input mentioned by “—new-data” argument for new dataset and “—onto” argument for old dataset. Based on “—target-dir” argument output is stored in HDFS.
This merging happens based on a unique primary key in each record where the column for primary key is mentioned by “—merge-key”. Multiple rows in the same dataset should not have the same primary key otherwise data loss may occur.
We can use codegen tool to get the class for extracting the key column, also can mention the class name and jar file by using “—class-name” and “—jar-file” parameters.
Let us take an example for explaining merge tool. Here we are merging 2 incremental imports with “—merge-key” as “id” where older data is in an HDFS directory named older and newer data is in an HDFS directory named newer:
$ sqoop merge --new-data newer --onto older --target-dir merged \ --jar-file datatypes.jar --class-name Foo --merge-key id
This will run a MapReduce job where the value of id column of each row is used to join rows; rows in the newer dataset will be used in preference to rows in the older dataset.
This can be used with both SequenceFile-, Avro- and text-based incremental imports. The file types of the newer and older datasets must be the same.
Sqoop- codegen
This codegen tool is used for generating java classes if we have a lost a java source or any problem occurs. By this we can create a new version of class with different delimiters between fields.
Recreate the record interpretation code for the employees table of a corporate database:
$ sqoop codegen --connect jdbc:mysql://db.example.com/corp \ --table employees
Sqoop-eval
Eval tool in Sqoop is used to run any simple queries on the database and it prints that results on the console. It is used only for testing purpose so don’t use it in production environment. We can check whether our import data is correct or not using eval tool.
Select ten records from the employees table:
$ sqoop eval --connect jdbc:mysql://db.example.com/corp \ --query "SELECT * FROM employees LIMIT 10"
Insert a row into the foo table:
$ sqoop eval --connect jdbc:mysql://db.example.com/corp \ -e "INSERT INTO foo VALUES(42, 'bar')"
Sqoop- list databases
It is used to list database schemas present on a server. List tool only works with MySql and Oracle and when we use Oracle we need to have the DBA privileges.
List database schemas available on a MySQL server:
$ sqoop list-databases --connect jdbc:mysql://database.example.com/ information_schema employees
Sqoop-list-tables
It is used to list tables present in the database.
List tables available in the “corp” database:
$ sqoop list-tables --connect jdbc:mysql://database.example.com/corp employees payroll_checks job_descriptions office_supplies