Sqoop – Export

Export:

The Export tool is used for exporting the data back from HDFS to any remote database of RDBMS. Like we can do all the analysis using Hive and export the generated data back to any external database. Here the target table must already exist in the remote database. So before exporting a table form HDFS into database, we must prepare a table in target table to receive the data.

We are going to export the student6 table from Hive. We need to create a table in MySQL that has target columns in the same order, with the appropriate SQL types:

mysql> create table student6(id int, name text);

Query OK, 0 rows affected (0.01 sec)



mysql> select * from student6;

Empty set (0.01 sec)

Then we run the export command:

hdadmin@ubuntu:~/sqoop-1.4.5-cdh5.3.2$ bin/sqoop export -connect jdbc:mysql://calhost:3306/mysql -username root -password **** --table student6 --export-dir /user/hive/warehouse/student6/student5.txt -m 1



17/12/24 00:47:50 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.3.2

17/12/24 00:47:50 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.

17/12/24 00:47:50 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.

17/12/24 00:47:50 INFO tool.CodeGenTool: Beginning code generation

…….

17/12/24 00:48:04 INFO mapreduce.Job:  map 0% reduce 0%

17/12/24 00:48:09 INFO mapreduce.Job:  map 100% reduce 0%

17/12/24 00:48:09 INFO mapreduce.Job: Job job_1514099010295_0007 completed successfully

17/12/24 00:48:10 INFO mapreduce.Job: Counters: 30

            File System Counters

                        FILE: Number of bytes read=0

                        FILE: Number of bytes written=129747

                        FILE: Number of read operations=0

                        FILE: Number of large read operations=0

                        FILE: Number of write operations=0

                        HDFS: Number of bytes read=186

                        HDFS: Number of bytes written=0

                        HDFS: Number of read operations=4

                        HDFS: Number of large read operations=0

                        HDFS: Number of write operations=0

            Job Counters

                        Launched map tasks=1

                        Data-local map tasks=1

                        Total time spent by all maps in occupied slots (ms)=3067

                        Total time spent by all reduces in occupied slots (ms)=0

                        Total time spent by all map tasks (ms)=3067

                        Total vcore-seconds taken by all map tasks=3067

                        Total megabyte-seconds taken by all map tasks=3140608

            Map-Reduce Framework

                        Map input records=4

                        Map output records=4

                        Input split bytes=144

                        Spilled Records=0

                        Failed Shuffles=0

                        Merged Map outputs=0

                        GC time elapsed (ms)=89

                        CPU time spent (ms)=530

                        Physical memory (bytes) snapshot=70418432

                        Virtual memory (bytes) snapshot=321912832

                        Total committed heap usage (bytes)=16318464

            File Input Format Counters

                        Bytes Read=0

            File Output Format Counters

                        Bytes Written=0

17/12/24 00:48:10 INFO mapreduce.ExportJobBase: Transferred 186 bytes in 16.2504 seconds (11.4459 bytes/sec)

17/12/24 00:48:10 INFO mapreduce.ExportJobBase: Exported 4 records.

Finally, we can verify that the export worked by checking MySQL:

mysql> select * from student6;

+----+----------+

| id | name     |

+----+----------+

|  2 | "shashi" |

|  4 | "sai"    |

|  5 | "kiran"  |

|  6 | "kamal"  |

+----+----------+

Here we have to mention the delimiters for exporting. The delimiters should match with the import process. For export process also we use JDBC. Sqoop creates a java class based on target table and that class parse records from text files and insert values of the appropriate types into a table.

Then a Map-reduce job is launched and does the export process. Here export process does batch inserts means adding multiple records at a time into the target table. We use separate threads from HDFS for batch inserts for better I/O operations.

Here in exporting we have 2 modes, ‘Update’ mode will generate update statements that replace existing record in database and ‘call’ mode where sqoop will generate a stored procedure call for each record. Here in above export syntax we have used ‘–export-dir’ , it is the directory in HDFS which contains the source data. We can use any of the two options like ‘—table’ or ‘—call’.

Similarly, to import we can also export only some of the columns separated by comma.

–columns “col1,col2,col3”.

Here the columns that are not included in the above syntax should have either defined default values or allow null values. Otherwise we get an error while exporting. We can control the number of mappers independently from the number of files present inside the directory by using –num-mappers or -m arguments.

Since sqoop breaks down export process into many transactions, there can be possibility of getting a failed export job. The failed export job can contain the partial data in the database.by this we can get other errors also, so we need to use ‘—staging-table’ option which acts as an auxiliary table used to stage exported data.

The staged data is moved into destination table in a single transaction. In order to use this facility, we first need to create a staging table before exporting. This table must be similar to target table and should be empty.

But the dis-advantage with staging table is that it is slower since the data must be written twice, first to the staging table then to the destination table. Export process also uses more space while it is running, since there are two copies of data while the staged data is being copied to the destination.

Now let us do inserts and update in sqoop. By default, insert operation is done in sqoop when we use ‘sqoop-export’. It appends new rows in the table and the table should contain primary key values not violated. When we use ‘—update-key’ argument , sqoop will modify the existing data set by using any column name(s).

CREATE TABLE test(

    id INT NOT NULL PRIMARY KEY,

    marks int );

Consider also a dataset in HDFS containing records like these:

0,20

1,50

...

Running sqoop-export –table test –update-key id –export-dir /path/to/data –connect … will run an export job that executes SQL statements based on the data like so:

UPDATE test SET marks=70 WHERE id=0;

If the update statement does not modify any rows, then this is not considered as an error. Similarly, if the column given in update key does not identify any rows then also it is not considered as an error. We can also use ‘allowinsert’ mode with update mode if you want to update rows if exist or insert rows if they do not exist in the database.