Sqoop – import:
Sqoop uses import for importing table from RDBMS to HDFS. Here each table which is imported is represented as different record in HDFS. First of all to use this feature, we need to create a database in Mysql. In Linux environment we can use sudo yum install mysql mysql-server to install Mysql. Next let us create a sample database in it.
In Ubuntu we can install mysql server as,
$sudo apt-get update $sudo apt-get install mysql-server
While installing give username as root and any password and then login to mysql console as,
mysql -p -u root -h localhost
Now just create a sample table in mysql database in MySql.
hdadmin@ubuntu:~/sqoop-1.4.5-cdh5.3.2$ mysql -p -u root -h localhost Enter password: Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 44 Server version: 5.5.58-0ubuntu0.14.04.1 (Ubuntu) Copyright (c) 2000, 2017, Oracle and/or its affiliates. All rights reserved. Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
Here be default we have 3 databases.
mysql> show databases; +--------------------+ | Database | +--------------------+ | information_schema | | mysql | | performance_schema | +--------------------+ 3 rows in set (0.00 sec)
We are using the mysql database and creating employee table and inserting values into it.
mysql> use mysql
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
mysql> show tables; +---------------------------+ | Tables_in_mysql | +---------------------------+ | columns_priv | | db | | emp | | event | | func | | general_log | | help_category | | help_keyword | | help_relation | | help_topic | | host | | ndb_binlog_index | | plugin | | proc | | procs_priv | | proxies_priv | | servers | | slow_log | | tables_priv | | time_zone | | time_zone_leap_second | | time_zone_name | | time_zone_transition | | time_zone_transition_type | | user | +---------------------------+ 26 rows in set (0.00 sec)
mysql> create table employee (id int, name text); mysql> insert into employee values (1, 'sai'); mysql> insert into employee values (2, 'kishore'); mysql> insert into employee values (3, 'chandu'); mysql> insert into employee values (4, 'gopal'); mysql> select * from employee; +------+---------+ | id | name | +------+---------+ | 1 | sai | | 2 | kishore | | 3 | chandu | | 4 | gopal | +------+---------+ 4 rows in set (0.00 sec)
Before going any further, you need to download the JDBC driver JAR file for MySQL (Connector/J) and add it to Sqoop’s classpath, which is simply achieved by placing it in Sqoop’s lib directory.
Now let’s use Sqoop to import this table into HDFS:
hdadmin@ubuntu:~/sqoop-1.4.5-cdh5.3.2$ bin/sqoop import -connect jdbc:mysql://localhost:3306/mysql -username root -password **** --table employee --target-dir /sqoop01 -m 1 17/12/22 22:10:43 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.3.2 17/12/22 22:10:44 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead. 17/12/22 22:10:44 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset. 17/12/22 22:10:44 INFO tool.CodeGenTool: Beginning code generation The new driver class is `com.mysql.cj.jdbc.Driver'. The driver is automatically registered via the SPI and manual loading of the driver class is generally unnecessary. 17/12/22 22:10:44 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `employee` AS t LIMIT 1 17/12/22 22:10:44 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `employee` AS t LIMIT 1 17/12/22 22:10:44 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /home/hdadmin/hadoop-2.5.0-cdh5.3.2 Note: /tmp/sqoop-hdadmin/compile/5fee2b1f026904af06f9b22a1d257939/employee.java uses or overrides a deprecated API. Note: Recompile with -Xlint:deprecation for details. 17/12/22 22:10:46 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hdadmin/compile/5fee2b1f026904af06f9b22a1d257939/employee.jar 17/12/22 22:10:46 WARN manager.MySQLManager: It looks like you are importing from mysql. 17/12/22 22:10:46 WARN manager.MySQLManager: This transfer can be faster! Use the --direct 17/12/22 22:10:46 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path. 17/12/22 22:10:46 INFO mapreduce.ImportJobBase: Beginning import of employee 17/12/22 22:10:46 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/12/22 22:10:46 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar 17/12/22 22:10:47 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps 17/12/22 22:10:47 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 17/12/22 22:10:51 INFO db.DBInputFormat: Using read commited transaction isolation 17/12/22 22:10:52 INFO mapreduce.JobSubmitter: number of splits:1 17/12/22 22:10:52 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1514005884865_0002 17/12/22 22:10:53 INFO impl.YarnClientImpl: Submitted application application_1514005884865_0002 17/12/22 22:10:53 INFO mapreduce.Job: The url to track the job: http://ubuntu:8088/proxy/application_1514005884865_0002/ 17/12/22 22:10:53 INFO mapreduce.Job: Running job: job_1514005884865_0002 17/12/22 22:11:06 INFO mapreduce.Job: Job job_1514005884865_0002 running in uber mode : false 17/12/22 22:11:06 INFO mapreduce.Job: map 0% reduce 0% 17/12/22 22:11:13 INFO mapreduce.Job: map 100% reduce 0% 17/12/22 22:11:13 INFO mapreduce.Job: Job job_1514005884865_0002 completed successfully 17/12/22 22:11:13 INFO mapreduce.Job: Counters: 30 File System Counters FILE: Number of bytes read=0 FILE: Number of bytes written=129813 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=87 HDFS: Number of bytes written=33 HDFS: Number of read operations=4 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=1 Other local map tasks=1 Total time spent by all maps in occupied slots (ms)=3646 Total time spent by all reduces in occupied slots (ms)=0 Total time spent by all map tasks (ms)=3646 Total vcore-seconds taken by all map tasks=3646 Total megabyte-seconds taken by all map tasks=3733504 Map-Reduce Framework Map input records=4 Map output records=4 Input split bytes=87 Spilled Records=0 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=125 CPU time spent (ms)=630 Physical memory (bytes) snapshot=72994816 Virtual memory (bytes) snapshot=323624960 Total committed heap usage (bytes)=16318464 File Input Format Counters Bytes Read=0 File Output Format Counters Bytes Written=33 17/12/22 22:11:13 INFO mapreduce.ImportJobBase: Transferred 33 bytes in 25.6382 seconds (1.2871 bytes/sec) 17/12/22 22:11:13 INFO mapreduce.ImportJobBase: Retrieved 4 records.
Here the connect string describes how to connect to the database. We can have server, database name, port number in it. Here in place of localhost we have to give the whole hostname or IP address if we are using it in Hadoop distributed cluster because the connect string which we supply will be used by the TaskTracker nodes in Cluster, so if we give as localhost then each node will connect to a different database which becomes problem. So always use hostname or IP address in place of localhost.
So here Import tool will run a map reduce job that connects to Mysql database and reads data from the table. Here we have mentioned ‘-m 1’ which means we are using only one map task. So a single file will be created in HDFS ‘sqoop001’ directory. By default, we have 4 map tasks which will run in parallel to speed up the execution process. By default, Sqoop will generate comma-delimited text files for our imported data. We can specify delimiters explicitly also.
We can see the file’s contents as:
hdadmin@ubuntu:~/sqoop-1.4.5-cdh5.3.2$ hdfs dfs -ls / Found 10 items drwxr-xr-x - hdadmin supergroup 0 2017-12-09 07:02 /data drwxr-xr-x - hdadmin supergroup 0 2017-12-11 00:20 /data1 drwxr-xr-x - hdadmin supergroup 0 2017-12-09 22:35 /data2 drwxr-xr-x - hdadmin supergroup 0 2017-12-09 22:56 /data3 drwxr-xr-x - hdadmin supergroup 0 2017-12-13 05:05 /flume-00001 drwxr-xr-x - hdadmin supergroup 0 2017-12-13 05:35 /flume-00002 drwxr-xr-x - hdadmin supergroup 0 2017-12-11 00:01 /home drwxr-xr-x - hdadmin supergroup 0 2017-12-22 22:11 /sqoop01 drwxr-xr-x - hdadmin supergroup 0 2017-12-10 00:06 /tmp drwxr-xr-x - hdadmin supergroup 0 2017-12-09 06:25 /user hdadmin@ubuntu:~/sqoop-1.4.5-cdh5.3.2$ hdfs dfs -ls /sqoop01 Found 2 items -rw-r--r-- 1 hdadmin supergroup 0 2017-12-22 22:11 /sqoop01/_SUCCESS -rw-r--r-- 1 hdadmin supergroup 33 2017-12-22 22:11 /sqoop01/part-m-00000 hdadmin@ubuntu:~/sqoop-1.4.5-cdh5.3.2$ hdfs dfs -cat /sqoop01/part-m-00000 1,sai 2,kishore 3,chandu 4,gopal
The basic output format is text file, but the text file format cannot hold binary fields (like VARBINARY column type) and cannot distinguish between null and String type fields, so we normally use Sequence Files, Avro datafiles and Parquet files. Generally, Avro and Parquet are used by many languages. Sequence files stores individual records in custom record-specific data types. We can also compress our data using ‘—compress’ argument.
Some of the delimiters we use are,
- a character (–fields-terminated-by X)
- an escape character (–fields-terminated-by \t). Supported escape characters are:
- \b (backspace)
- \n (newline)
- \r (carriage return)
- \t (tab)
- \” (double-quote)
- \\’ (single-quote)
- \\ (backslash)
- \0 (NUL) – This will insert NUL characters between fields or lines, or will disable enclosing/escaping if used for one of the –enclosed-by, –optionally-enclosed-by, or –escaped-by arguments.
- The octal representation will be in the form of \0ooo, where ooois the octal value. For example, –fields-terminated-by \001 would yield the ^A character.
- The hexadecimal representation will be in the form of \0xhhh, where hhhis the hex value. For example, –fields-terminated-by \0x10 would yield the c character.
The default delimiters are a comma (,) for fields, a newline (\n) for records. If our imported records contain the default delimiters, then it can cause us problem so make them enable by using ‘–mysql-delimiters’ argument.