Sqoop – import

Sqoop – import:

Sqoop uses import for importing table from RDBMS to HDFS. Here each table which is imported is represented as different record in HDFS. First of all to use this feature, we need to create a database in Mysql. In Linux environment we can use sudo yum install mysql mysql-server to install Mysql. Next let us create a sample database in it.

In Ubuntu we can install mysql server as,

$sudo apt-get update

$sudo apt-get install mysql-server

While installing give username as root and any password and then login to mysql console as,

mysql -p -u root -h localhost

Now just create a sample table in mysql database in MySql.

hdadmin@ubuntu:~/sqoop-1.4.5-cdh5.3.2$ mysql -p -u root -h localhost

Enter password:

Welcome to the MySQL monitor.  Commands end with ; or \g.

Your MySQL connection id is 44

Server version: 5.5.58-0ubuntu0.14.04.1 (Ubuntu)



Copyright (c) 2000, 2017, Oracle and/or its affiliates. All rights reserved.



Oracle is a registered trademark of Oracle Corporation and/or its

affiliates. Other names may be trademarks of their respective

owners.



Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

Here be default we have 3 databases.

mysql> show databases;

+--------------------+

| Database           |

+--------------------+

| information_schema |

| mysql              |

| performance_schema |

+--------------------+

3 rows in set (0.00 sec)

We are using the mysql database and creating employee table and inserting values into it.

mysql> use mysql

Reading table information for completion of table and column names

You can turn off this feature to get a quicker startup with -A

Database changed

mysql> show tables;

+---------------------------+

| Tables_in_mysql           |

+---------------------------+

| columns_priv              |

| db                        |

| emp                       |

| event                     |

| func                      |

| general_log               |

| help_category             |

| help_keyword              |

| help_relation             |

| help_topic                |

| host                      |

| ndb_binlog_index          |

| plugin                    |

| proc                      |

| procs_priv                |

| proxies_priv              |

| servers                   |

| slow_log                  |

| tables_priv               |

| time_zone                 |

| time_zone_leap_second     |

| time_zone_name            |

| time_zone_transition      |

| time_zone_transition_type |

| user                      |

+---------------------------+

26 rows in set (0.00 sec)

mysql> create table employee (id int, name text);



mysql> insert into employee values (1, 'sai');

mysql> insert into employee values (2, 'kishore');

mysql> insert into employee values (3, 'chandu');

mysql> insert into employee values (4, 'gopal');



mysql> select * from employee;

+------+---------+

| id   | name    |

+------+---------+

|    1 | sai     |

|    2 | kishore |

|    3 | chandu  |

|    4 | gopal   |

+------+---------+

4 rows in set (0.00 sec)

Before going any further, you need to download the JDBC driver JAR file for MySQL (Connector/J) and add it to Sqoop’s classpath, which is simply achieved by placing it in Sqoop’s lib directory.

Now let’s use Sqoop to import this table into HDFS:

hdadmin@ubuntu:~/sqoop-1.4.5-cdh5.3.2$ bin/sqoop import -connect jdbc:mysql://localhost:3306/mysql -username root -password **** --table employee --target-dir /sqoop01 -m 1



17/12/22 22:10:43 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.3.2

17/12/22 22:10:44 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.

17/12/22 22:10:44 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.

17/12/22 22:10:44 INFO tool.CodeGenTool: Beginning code generation

The new driver class is `com.mysql.cj.jdbc.Driver'. The driver is automatically registered via the SPI and manual loading of the driver class is generally unnecessary.

17/12/22 22:10:44 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `employee` AS t LIMIT 1

17/12/22 22:10:44 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `employee` AS t LIMIT 1

17/12/22 22:10:44 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /home/hdadmin/hadoop-2.5.0-cdh5.3.2

Note: /tmp/sqoop-hdadmin/compile/5fee2b1f026904af06f9b22a1d257939/employee.java uses or overrides a deprecated API.

Note: Recompile with -Xlint:deprecation for details.

17/12/22 22:10:46 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hdadmin/compile/5fee2b1f026904af06f9b22a1d257939/employee.jar

17/12/22 22:10:46 WARN manager.MySQLManager: It looks like you are importing from mysql.

17/12/22 22:10:46 WARN manager.MySQLManager: This transfer can be faster! Use the --direct

17/12/22 22:10:46 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.

17/12/22 22:10:46 INFO mapreduce.ImportJobBase: Beginning import of employee

17/12/22 22:10:46 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

17/12/22 22:10:46 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar

17/12/22 22:10:47 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps

17/12/22 22:10:47 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032

17/12/22 22:10:51 INFO db.DBInputFormat: Using read commited transaction isolation

17/12/22 22:10:52 INFO mapreduce.JobSubmitter: number of splits:1

17/12/22 22:10:52 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1514005884865_0002

17/12/22 22:10:53 INFO impl.YarnClientImpl: Submitted application application_1514005884865_0002

17/12/22 22:10:53 INFO mapreduce.Job: The url to track the job: http://ubuntu:8088/proxy/application_1514005884865_0002/

17/12/22 22:10:53 INFO mapreduce.Job: Running job: job_1514005884865_0002

17/12/22 22:11:06 INFO mapreduce.Job: Job job_1514005884865_0002 running in uber mode : false

17/12/22 22:11:06 INFO mapreduce.Job:  map 0% reduce 0%

17/12/22 22:11:13 INFO mapreduce.Job:  map 100% reduce 0%

17/12/22 22:11:13 INFO mapreduce.Job: Job job_1514005884865_0002 completed successfully

17/12/22 22:11:13 INFO mapreduce.Job: Counters: 30

            File System Counters

                        FILE: Number of bytes read=0

                        FILE: Number of bytes written=129813

                        FILE: Number of read operations=0

                        FILE: Number of large read operations=0

                        FILE: Number of write operations=0

                        HDFS: Number of bytes read=87

                        HDFS: Number of bytes written=33

                        HDFS: Number of read operations=4

                        HDFS: Number of large read operations=0

                        HDFS: Number of write operations=2

            Job Counters

                        Launched map tasks=1

                        Other local map tasks=1

                        Total time spent by all maps in occupied slots (ms)=3646

                        Total time spent by all reduces in occupied slots (ms)=0

                        Total time spent by all map tasks (ms)=3646

                        Total vcore-seconds taken by all map tasks=3646

                        Total megabyte-seconds taken by all map tasks=3733504

            Map-Reduce Framework

                        Map input records=4

                        Map output records=4

                        Input split bytes=87

                        Spilled Records=0

                        Failed Shuffles=0

                        Merged Map outputs=0

                        GC time elapsed (ms)=125

                        CPU time spent (ms)=630

                        Physical memory (bytes) snapshot=72994816

                        Virtual memory (bytes) snapshot=323624960

                        Total committed heap usage (bytes)=16318464

            File Input Format Counters

                        Bytes Read=0

            File Output Format Counters

                        Bytes Written=33

17/12/22 22:11:13 INFO mapreduce.ImportJobBase: Transferred 33 bytes in 25.6382 seconds (1.2871 bytes/sec)

17/12/22 22:11:13 INFO mapreduce.ImportJobBase: Retrieved 4 records.

Here the connect string describes how to connect to the database. We can have server, database name, port number in it. Here in place of localhost we have to give the whole hostname or IP address if we are using it in Hadoop distributed cluster because the connect string which we supply will be used by the TaskTracker nodes in Cluster, so if we give as localhost then each node will connect to a different database which becomes problem. So always use hostname or IP address in place of localhost.

So here Import tool will run a map reduce job that connects to Mysql database and reads data from the table. Here we have mentioned ‘-m 1’ which means we are using only one map task. So a single file will be created in HDFS ‘sqoop001’ directory. By default, we have 4 map tasks which will run in parallel to speed up the execution process. By default, Sqoop will generate comma-delimited text files for our imported data. We can specify delimiters explicitly also.

We can see the file’s contents as:

hdadmin@ubuntu:~/sqoop-1.4.5-cdh5.3.2$ hdfs dfs -ls /

Found 10 items

drwxr-xr-x   - hdadmin supergroup          0 2017-12-09 07:02 /data

drwxr-xr-x   - hdadmin supergroup          0 2017-12-11 00:20 /data1

drwxr-xr-x   - hdadmin supergroup          0 2017-12-09 22:35 /data2

drwxr-xr-x   - hdadmin supergroup          0 2017-12-09 22:56 /data3

drwxr-xr-x   - hdadmin supergroup          0 2017-12-13 05:05 /flume-00001

drwxr-xr-x   - hdadmin supergroup          0 2017-12-13 05:35 /flume-00002

drwxr-xr-x   - hdadmin supergroup          0 2017-12-11 00:01 /home

drwxr-xr-x   - hdadmin supergroup          0 2017-12-22 22:11 /sqoop01

drwxr-xr-x   - hdadmin supergroup          0 2017-12-10 00:06 /tmp

drwxr-xr-x   - hdadmin supergroup          0 2017-12-09 06:25 /user



hdadmin@ubuntu:~/sqoop-1.4.5-cdh5.3.2$ hdfs dfs -ls /sqoop01

Found 2 items

-rw-r--r--   1 hdadmin supergroup          0 2017-12-22 22:11 /sqoop01/_SUCCESS

-rw-r--r--   1 hdadmin supergroup         33 2017-12-22 22:11 /sqoop01/part-m-00000



hdadmin@ubuntu:~/sqoop-1.4.5-cdh5.3.2$ hdfs dfs -cat /sqoop01/part-m-00000



1,sai

2,kishore

3,chandu

4,gopal

The basic output format is text file, but the text file format cannot hold binary fields (like VARBINARY column type) and cannot distinguish between null and String type fields, so we normally use Sequence Files, Avro datafiles and Parquet files. Generally, Avro and Parquet are used by many languages. Sequence files stores individual records in custom record-specific data types. We can also compress our data using ‘—compress’ argument.

Some of the delimiters we use are,

a character (–fields-terminated-by X)
an escape character (–fields-terminated-by \t). Supported escape characters are:
1. \b (backspace)
2. \n (newline)
3. \r (carriage return)
4. \t (tab)
5. \” (double-quote)
6. \\’ (single-quote)
7. \\ (backslash)
8. \0 (NUL) – This will insert NUL characters between fields or lines, or will disable enclosing/escaping if used for one of the –enclosed-by, –optionally-enclosed-by, or –escaped-by arguments.
The octal representation will be in the form of \0ooo, where ooois the octal value. For example, –fields-terminated-by \001 would yield the ^A character.
The hexadecimal representation will be in the form of \0xhhh, where hhhis the hex value. For example, –fields-terminated-by \0x10 would yield the c character.

The default delimiters are a comma (,) for fields, a newline (\n) for records. If our imported records contain the default delimiters, then it can cause us problem so make them enable by using ‘–mysql-delimiters’ argument.