/    /  Sqoop-Other features of Import

Other features of Import:

We can also import a subset of columns form the table, import particular rows by using ‘where’ clause. Instead of using column, rows we can also give a query for importing purpose. Example,

–columns “name,employee_id,jobtitle”.

SELECT <column list> FROM <table name> where “id>45”

$ sqoop import \

  --query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \

  --split-by a.id --target-dir /user/foo/joinresults

So here we are importing by giving a query and a split column. During execution the split column is used for splitting the workload. By default, the split column will be a primary key column.

Suppose take an example we have 100 primary key values and 2 mappers running on it then split column will divide that 100 primary key values into 2 different columns having 50 primary key values each. Split column does the splitting based on Max and Min values which are generated automatically by sqoop.

So by this the execution will be faster and we get the results fastly. If the primary key values are not uniformly divided, then we can give our own split column (not multiple columns) by using –split-by argument.

If the table does not contain a primary key or –split-by value then the import process fails unless we specify number of mappers as 1. This will create one single file as the output in HDFS.

Here by default we use JDBC connector, but for fast execution we can also use database specific tools. We have separate DB specific tools for Mysql, oracle,etc. For Mysql we use ‘mysqldump’ tool which can export data from Mysql to other systems very quickly.

We will have to use ‘–direct’ argument for using of database specific tools. We cannot handle large objects like BLOB, CLOB for MySql using direct-mode import.

Sqoop can perform direct-mode imports for PostgreSQL, Oracle, and Netezza. Even when direct mode is used to access the contents of a database, the metadata is still queried through JDBC.

Import the data into Hbase:

We can also import he data into Hbase by specifying “—hbase-table” argument. Here each and every row will be changed into a row of output table by using Hbase put operation. We can give the row key column by using “—hbase-row-key” for splitting the column. We can give column family by “—column.family” argument. Before running the import we should create the target table and column family first otherwise the sqoop job will fail. We can create table using “—hbase-create-table”.

Import all tables:

This option imports all the tables from RDBMS into HDFS. Here we have to import all the columns of the tables and do not use ‘where’ clause. Data from each table is stored in separate directory in HDFS.

Import all tables from the mysql database:

$ bin/sqoop import-all-tables -connect jdbc:mysql://localhost:3306/mysql -username root -password **** --target-dir /sqoop9 -m 1