Embulk を使って CSV から MySQL にデータ転送する
データを用意する
https://github.com/livedoor/datasets
$ wget https://github.com/livedoor/datasets/raw/master/ldgourmet.tar.gz $ tar xvfz ldgourmet.tar.gz $ mkdir ldgourmet $ mv *.csv ldgourmet
- ちょっと古いですが20万件くらいの店舗データが用意されてる
- LINE さんありがとう :pray:
embulk つかって データの転送
$ curl --create-dirs -o ~/.embulk/bin/embulk -L "http://dl.embulk.org/embulk-latest.jar" $ chmod +x ~/.embulk/bin/embulk $ echo 'export PATH="$HOME/.embulk/bin:$PATH"' >> ~/.bashrc $ source ~/.bashrc
確認
$ embulk --version embulk 0.8.38
input, output のプラグイン追加 (のちに、MySQL から Elasticsearch にも転送してみるので関係するのもいっしょにいれておく)
$ embulk gem install embulk-input-mysql embulk-output-mysql embulk-output-elasticsearch
CSV から MySQL へ
embulk guess で config.yml ファイルを作成 (設定ファイルを自動生成)
in: type: file path_prefix: "./restaurants.csv" out: type: stdout
これで、いい感じの yaml を作ってくれる
output 先を mysql の config に変えておく
$ embulk restaurants_seed.yml -o restaurants_config.yml
in: type: file path_prefix: ./restaurants.csv parser: charset: UTF-8 newline: LF type: csv delimiter: ',' quote: '"' escape: '"' trim_if_not_quoted: false skip_header_lines: 1 allow_extra_columns: false allow_optional_columns: false columns: - {name: id, type: long} - {name: name, type: string} - {name: property, type: string} - {name: alphabet, type: string} - {name: name_kana, type: string} - {name: pref_id, type: long} - {name: area_id, type: long} - {name: station_id1, type: long} - {name: station_time1, type: long} - {name: station_distance1, type: long} - {name: station_id2, type: long} - {name: station_time2, type: long} - {name: station_distance2, type: long} - {name: station_id3, type: long} - {name: station_time3, type: long} - {name: station_distance3, type: long} - {name: category_id1, type: long} - {name: category_id2, type: long} - {name: category_id3, type: long} - {name: category_id4, type: long} - {name: category_id5, type: long} - {name: zip, type: string} - {name: address, type: string} - {name: north_latitude, type: string} - {name: east_longitude, type: string} - {name: description, type: string} - {name: purpose, type: string} - {name: open_morning, type: long} - {name: open_lunch, type: long} - {name: open_late, type: long} - {name: photo_count, type: long} - {name: special_count, type: long} - {name: menu_count, type: long} - {name: fan_count, type: long} - {name: access_count, type: long} - {name: created_on, type: timestamp, format: '%Y-%m-%d %H:%M:%S'} - {name: modified_on, type: timestamp, format: '%Y-%m-%d %H:%M:%S'} - {name: closed, type: long} out: # type: stdout type: mysql host: localhost user: root password: '' database: ldgourmet table: restaurants mode: replace
プレビュー
$ embulk preview -G restaurants_config.yml 2017-12-23 17:43:19.206 +0900: Embulk v0.8.38 ********************************** INFORMATION ********************************** Join us! Embulk-announce mailing list is up for IMPORTANT annoucement such as compatibility-breaking changes and key feature updates. https://groups.google.com/forum/#!forum/embulk-announce ********************************************************************************* 2017-12-23 17:43:21.641 +0900 [INFO] (0001:preview): Listing local files at directory '.' filtering filename by prefix 'restaurants.csv' 2017-12-23 17:43:21.643 +0900 [INFO] (0001:preview): "follow_symlinks" is set false. Note that symbolic links to directories are skipped. 2017-12-23 17:43:21.650 +0900 [INFO] (0001:preview): Loading files [restaurants.csv] 2017-12-23 17:43:21.656 +0900 [INFO] (0001:preview): Try to read 32,768 bytes from input source *************************** 1 *************************** id ( long) : 2 name ( string) : ラ・マーレ・ド・茶屋 property ( string) : 2F・3F alphabet ( string) : LA MAREE DE CHAYA name_kana ( string) : らまーれどちゃや pref_id ( long) : 14 area_id ( long) : 1,013 station_id1 ( long) : 2,338 station_time1 ( long) : 22 station_distance1 ( long) : 1,789 station_id2 ( long) : 2,401 station_time2 ( long) : 28 station_distance2 ( long) : 2,240 station_id3 ( long) : 2,867 station_time3 ( long) : 47 station_distance3 ( long) : 3,755 category_id1 ( long) : 201 category_id2 ( long) : 0 category_id3 ( long) : 0 category_id4 ( long) : 0 category_id5 ( long) : 0 zip ( string) : 240-0113 address ( string) : 三浦郡葉山町堀内24-3 north_latitude ( string) : 35.16.53.566 east_longitude ( string) : 139.34.20.129 description ( string) : こちら2.3Fのレストランへのコメントになります。 『ラ・マーレ・ド・茶屋』1F(テラス&バー)へのコメントはそちらにお願いします。 駐車場15台(専用) 06/06/19 営業時間等更新(From東京グルメ) purpose ( string) : open_morning ( long) : 0 open_lunch ( long) : 1 open_late ( long) : 0 photo_count ( long) : 1 special_count ( long) : 0 menu_count ( long) : 0 fan_count ( long) : 5 access_count ( long) : 6,535 created_on (timestamp) : 2000-09-10 11:22:02 UTC modified_on (timestamp) : 2011-04-22 16:05:12 UTC closed ( long) : 0 *************************** 2 ***************************
実行
$ embulk run restaurants_config.yml
確認
[ldgourmet] mysql> select count(*) from restaurants; +----------+ | count(*) | +----------+ | 214236 | +----------+ 1 row in set (0.10 sec)
desc
[ldgourmet] mysql> desc restaurants; +-------------------+------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-------------------+------------+------+-----+---------+-------+ | id | bigint(20) | YES | | NULL | | | name | text | YES | | NULL | | | property | text | YES | | NULL | | | alphabet | text | YES | | NULL | | | name_kana | text | YES | | NULL | | | pref_id | bigint(20) | YES | | NULL | | | area_id | bigint(20) | YES | | NULL | | | station_id1 | bigint(20) | YES | | NULL | | | station_time1 | bigint(20) | YES | | NULL | | | station_distance1 | bigint(20) | YES | | NULL | | | station_id2 | bigint(20) | YES | | NULL | | | station_time2 | bigint(20) | YES | | NULL | | | station_distance2 | bigint(20) | YES | | NULL | | | station_id3 | bigint(20) | YES | | NULL | | | station_time3 | bigint(20) | YES | | NULL | | | station_distance3 | bigint(20) | YES | | NULL | | | category_id1 | bigint(20) | YES | | NULL | | | category_id2 | bigint(20) | YES | | NULL | | | category_id3 | bigint(20) | YES | | NULL | | | category_id4 | bigint(20) | YES | | NULL | | | category_id5 | bigint(20) | YES | | NULL | | | zip | text | YES | | NULL | | | address | text | YES | | NULL | | | north_latitude | text | YES | | NULL | | | east_longitude | text | YES | | NULL | | | description | text | YES | | NULL | | | purpose | text | YES | | NULL | | | open_morning | bigint(20) | YES | | NULL | | | open_lunch | bigint(20) | YES | | NULL | | | open_late | bigint(20) | YES | | NULL | | | photo_count | bigint(20) | YES | | NULL | | | special_count | bigint(20) | YES | | NULL | | | menu_count | bigint(20) | YES | | NULL | | | fan_count | bigint(20) | YES | | NULL | | | access_count | bigint(20) | YES | | NULL | | | created_on | timestamp | YES | | NULL | | | modified_on | timestamp | YES | | NULL | | | closed | bigint(20) | YES | | NULL | | +-------------------+------------+------+-----+---------+-------+ 38 rows in set (0.00 sec)
[ldgourmet] mysql> select * from restaurants limit 1\G *************************** 1. row *************************** id: 2 name: ラ・マーレ・ド・茶屋 property: 2F・3F alphabet: LA MAREE DE CHAYA name_kana: らまーれどちゃや pref_id: 14 area_id: 1013 station_id1: 2338 station_time1: 22 station_distance1: 1789 station_id2: 2401 station_time2: 28 station_distance2: 2240 station_id3: 2867 station_time3: 47 station_distance3: 3755 category_id1: 201 category_id2: 0 category_id3: 0 category_id4: 0 category_id5: 0 zip: 240-0113 address: 三浦郡葉山町堀内24-3 north_latitude: 35.16.53.566 east_longitude: 139.34.20.129 description: こちら2.3Fのレストランへのコメントになります。 『ラ・マーレ・ド・茶屋』1F(テラス&バー)へのコメントはそちらにお願いします。 駐車場15台(専用) 06/06/19 営業時間等更新(From東京グルメ) purpose: NULL open_morning: 0 open_lunch: 1 open_late: 0 photo_count: 1 special_count: 0 menu_count: 0 fan_count: 5 access_count: 6535 created_on: 2000-09-10 20:22:02 modified_on: 2011-04-23 01:05:12 closed: 0 1 row in set (0.00 sec)
20万レコード、15秒位でインサートできるのすごい!! 必要なテーブル、この作業をしとく。
つぎは、Elasticsearch
MySQL => Elasticsearch に転送する