Embulk を使って CSV から MySQL にデータ転送する

データを用意する

https://github.com/livedoor/datasets

livedoor グルメの研究用データセットです。 2011年4月22日時点でのデータになります。

$ wget https://github.com/livedoor/datasets/raw/master/ldgourmet.tar.gz
$ tar xvfz ldgourmet.tar.gz
$ mkdir ldgourmet
$ mv *.csv ldgourmet
  • ちょっと古いですが20万件くらいの店舗データが用意されてる
  • LINE さんありがとう :pray:

embulk つかって データの転送

CSV => MYSQL

http://www.embulk.org/docs/

インスコ

$ curl --create-dirs -o ~/.embulk/bin/embulk -L "http://dl.embulk.org/embulk-latest.jar"
$ chmod +x ~/.embulk/bin/embulk
$ echo 'export PATH="$HOME/.embulk/bin:$PATH"' >> ~/.bashrc
$ source ~/.bashrc

確認

$ embulk --version
embulk 0.8.38

input, output のプラグイン追加 (のちに、MySQL から Elasticsearch にも転送してみるので関係するのもいっしょにいれておく)

$ embulk gem install embulk-input-mysql embulk-output-mysql embulk-output-elasticsearch

CSV から MySQL

embulk guess で config.yml ファイルを作成 (設定ファイルを自動生成)

in:
  type: file
  path_prefix: "./restaurants.csv"
out:
  type: stdout

これで、いい感じの yaml を作ってくれる
output 先を mysql の config に変えておく

$ embulk restaurants_seed.yml -o restaurants_config.yml
in:
  type: file
  path_prefix: ./restaurants.csv
  parser:
    charset: UTF-8
    newline: LF
    type: csv
    delimiter: ','
    quote: '"'
    escape: '"'
    trim_if_not_quoted: false
    skip_header_lines: 1
    allow_extra_columns: false
    allow_optional_columns: false
    columns:
    - {name: id, type: long}
    - {name: name, type: string}
    - {name: property, type: string}
    - {name: alphabet, type: string}
    - {name: name_kana, type: string}
    - {name: pref_id, type: long}
    - {name: area_id, type: long}
    - {name: station_id1, type: long}
    - {name: station_time1, type: long}
    - {name: station_distance1, type: long}
    - {name: station_id2, type: long}
    - {name: station_time2, type: long}
    - {name: station_distance2, type: long}
    - {name: station_id3, type: long}
    - {name: station_time3, type: long}
    - {name: station_distance3, type: long}
    - {name: category_id1, type: long}
    - {name: category_id2, type: long}
    - {name: category_id3, type: long}
    - {name: category_id4, type: long}
    - {name: category_id5, type: long}
    - {name: zip, type: string}
    - {name: address, type: string}
    - {name: north_latitude, type: string}
    - {name: east_longitude, type: string}
    - {name: description, type: string}
    - {name: purpose, type: string}
    - {name: open_morning, type: long}
    - {name: open_lunch, type: long}
    - {name: open_late, type: long}
    - {name: photo_count, type: long}
    - {name: special_count, type: long}
    - {name: menu_count, type: long}
    - {name: fan_count, type: long}
    - {name: access_count, type: long}
    - {name: created_on, type: timestamp, format: '%Y-%m-%d %H:%M:%S'}
    - {name: modified_on, type: timestamp, format: '%Y-%m-%d %H:%M:%S'}
    - {name: closed, type: long}
out:
  # type: stdout
  type: mysql
  host: localhost
  user: root
  password: ''
  database: ldgourmet
  table: restaurants
  mode: replace

プレビュー

$ embulk preview -G restaurants_config.yml
2017-12-23 17:43:19.206 +0900: Embulk v0.8.38

********************************** INFORMATION **********************************
  Join us! Embulk-announce mailing list is up for IMPORTANT annoucement such as
    compatibility-breaking changes and key feature updates.
  https://groups.google.com/forum/#!forum/embulk-announce
*********************************************************************************

2017-12-23 17:43:21.641 +0900 [INFO] (0001:preview): Listing local files at directory '.' filtering filename by prefix 'restaurants.csv'
2017-12-23 17:43:21.643 +0900 [INFO] (0001:preview): "follow_symlinks" is set false. Note that symbolic links to directories are skipped.
2017-12-23 17:43:21.650 +0900 [INFO] (0001:preview): Loading files [restaurants.csv]
2017-12-23 17:43:21.656 +0900 [INFO] (0001:preview): Try to read 32,768 bytes from input source
*************************** 1 ***************************
               id (     long) : 2
             name (   string) : ラ・マーレ・ド・茶屋
         property (   string) : 2F・3F
         alphabet (   string) : LA MAREE DE CHAYA
        name_kana (   string) : らまーれどちゃや
          pref_id (     long) : 14
          area_id (     long) : 1,013
      station_id1 (     long) : 2,338
    station_time1 (     long) : 22
station_distance1 (     long) : 1,789
      station_id2 (     long) : 2,401
    station_time2 (     long) : 28
station_distance2 (     long) : 2,240
      station_id3 (     long) : 2,867
    station_time3 (     long) : 47
station_distance3 (     long) : 3,755
     category_id1 (     long) : 201
     category_id2 (     long) : 0
     category_id3 (     long) : 0
     category_id4 (     long) : 0
     category_id5 (     long) : 0
              zip (   string) : 240-0113
          address (   string) : 三浦郡葉山町堀内24-3
   north_latitude (   string) : 35.16.53.566
   east_longitude (   string) : 139.34.20.129
      description (   string) : こちら2.3Fのレストランへのコメントになります。  『ラ・マーレ・ド・茶屋』1F(テラス&バー)へのコメントはそちらにお願いします。    駐車場15台(専用)    06/06/19 営業時間等更新(From東京グルメ)
          purpose (   string) :
     open_morning (     long) : 0
       open_lunch (     long) : 1
        open_late (     long) : 0
      photo_count (     long) : 1
    special_count (     long) : 0
       menu_count (     long) : 0
        fan_count (     long) : 5
     access_count (     long) : 6,535
       created_on (timestamp) : 2000-09-10 11:22:02 UTC
      modified_on (timestamp) : 2011-04-22 16:05:12 UTC
           closed (     long) : 0
*************************** 2 ***************************

実行

$ embulk run restaurants_config.yml

確認

[ldgourmet] mysql> select count(*) from restaurants;
+----------+
| count(*) |
+----------+
|   214236 |
+----------+
1 row in set (0.10 sec)

desc

[ldgourmet] mysql> desc restaurants;
+-------------------+------------+------+-----+---------+-------+
| Field             | Type       | Null | Key | Default | Extra |
+-------------------+------------+------+-----+---------+-------+
| id                | bigint(20) | YES  |     | NULL    |       |
| name              | text       | YES  |     | NULL    |       |
| property          | text       | YES  |     | NULL    |       |
| alphabet          | text       | YES  |     | NULL    |       |
| name_kana         | text       | YES  |     | NULL    |       |
| pref_id           | bigint(20) | YES  |     | NULL    |       |
| area_id           | bigint(20) | YES  |     | NULL    |       |
| station_id1       | bigint(20) | YES  |     | NULL    |       |
| station_time1     | bigint(20) | YES  |     | NULL    |       |
| station_distance1 | bigint(20) | YES  |     | NULL    |       |
| station_id2       | bigint(20) | YES  |     | NULL    |       |
| station_time2     | bigint(20) | YES  |     | NULL    |       |
| station_distance2 | bigint(20) | YES  |     | NULL    |       |
| station_id3       | bigint(20) | YES  |     | NULL    |       |
| station_time3     | bigint(20) | YES  |     | NULL    |       |
| station_distance3 | bigint(20) | YES  |     | NULL    |       |
| category_id1      | bigint(20) | YES  |     | NULL    |       |
| category_id2      | bigint(20) | YES  |     | NULL    |       |
| category_id3      | bigint(20) | YES  |     | NULL    |       |
| category_id4      | bigint(20) | YES  |     | NULL    |       |
| category_id5      | bigint(20) | YES  |     | NULL    |       |
| zip               | text       | YES  |     | NULL    |       |
| address           | text       | YES  |     | NULL    |       |
| north_latitude    | text       | YES  |     | NULL    |       |
| east_longitude    | text       | YES  |     | NULL    |       |
| description       | text       | YES  |     | NULL    |       |
| purpose           | text       | YES  |     | NULL    |       |
| open_morning      | bigint(20) | YES  |     | NULL    |       |
| open_lunch        | bigint(20) | YES  |     | NULL    |       |
| open_late         | bigint(20) | YES  |     | NULL    |       |
| photo_count       | bigint(20) | YES  |     | NULL    |       |
| special_count     | bigint(20) | YES  |     | NULL    |       |
| menu_count        | bigint(20) | YES  |     | NULL    |       |
| fan_count         | bigint(20) | YES  |     | NULL    |       |
| access_count      | bigint(20) | YES  |     | NULL    |       |
| created_on        | timestamp  | YES  |     | NULL    |       |
| modified_on       | timestamp  | YES  |     | NULL    |       |
| closed            | bigint(20) | YES  |     | NULL    |       |
+-------------------+------------+------+-----+---------+-------+
38 rows in set (0.00 sec)
[ldgourmet] mysql> select * from restaurants limit 1\G
*************************** 1. row ***************************
               id: 2
             name: ラ・マーレ・ド・茶屋
         property: 2F・3F
         alphabet: LA MAREE DE CHAYA
        name_kana: らまーれどちゃや
          pref_id: 14
          area_id: 1013
      station_id1: 2338
    station_time1: 22
station_distance1: 1789
      station_id2: 2401
    station_time2: 28
station_distance2: 2240
      station_id3: 2867
    station_time3: 47
station_distance3: 3755
     category_id1: 201
     category_id2: 0
     category_id3: 0
     category_id4: 0
     category_id5: 0
              zip: 240-0113
          address: 三浦郡葉山町堀内24-3
   north_latitude: 35.16.53.566
   east_longitude: 139.34.20.129
      description: こちら2.3Fのレストランへのコメントになります。  『ラ・マーレ・ド・茶屋』1F(テラス&バー)へのコメントはそちらにお願いします。    駐車場15台(専用)    06/06/19 営業時間等更新(From東京グルメ)
          purpose: NULL
     open_morning: 0
       open_lunch: 1
        open_late: 0
      photo_count: 1
    special_count: 0
       menu_count: 0
        fan_count: 5
     access_count: 6535
       created_on: 2000-09-10 20:22:02
      modified_on: 2011-04-23 01:05:12
           closed: 0
1 row in set (0.00 sec)

20万レコード、15秒位でインサートできるのすごい!! 必要なテーブル、この作業をしとく。

つぎは、Elasticsearch

MySQL => Elasticsearch に転送する