oi

2017年3月12日日曜日

【続報】流行りの格安SIMに乗り換えてみた（iPhone SE 64GB SIMフリー版＋FREETEL SIM for iPhone）

2016年6月に、流行りの格安SIMに乗り換えてみた（iPhone SE 64GB SIMフリー版＋FREETEL SIM for iPhone）のポストで、格安SIMに乗り換える方法と乗り換えた場合の費用計算シミュレーションについて述べた。

今回は、2016年6月から2017年2月までのFREETELの利用実績を基に、想定利用金額を再設定し、費用計算シミュレーション結果を記載する。

利用料金の実績

まず、2016年6月から2017年2月までのFREETELの利用実績を以下のグラフに示す。

7月〜12月の6ヶ月間は、FREETELの「最大１年間０円キャンペーン」によって、1GB分の料金（499円）が差引かれている。

割引分499円を当該月に加算し9ヶ月の平均を算出すると、2,430（円/月）程度となった。

利用料金の内訳

次に利用明細（2016年12月分）を以下の表に示す。前回のポストでは、データ通信量（以下の表では基本使用量と記載）のみを考慮して費用計算を行っていた。しかしながら、実際利用してみると、LINEなどの無料通話、メールだけでは連絡できない場合が多々あったため、通話料およびSMS送信量が発生した。

また、データ通信料については、9ヶ月のうち7ヶ月が、3GB未満の利用であった。これは、そもそもデータ通信を行わないこととと、wifi環境が充実してきたためと考えられる。

ご利用明細
摘要	ご利用月	課税	料金
◆基本使用料
使った分だけ安心プラン(ドコモ回線)for iPhone 音声通話付	2016/12	外税	¥1,600
最大1年間0円キャンペーン第5弾( 6ヵ月)	2016/12	外税	-¥499
◆通話料
国内音声通話料	2016/11	外税	¥580
国内SMS送信料	2016/11	外税	¥12
◆その他
ユニバーサルサービス料(1番号当たり3円の請求となります)	2016/12	外税	¥3
小計（課税対象）			¥1,696
消費税			¥135
小計（内税/非課税）			¥0
合計			¥1,831

実績に基づく費用計算シミュレーション

最後に、利用実績に基づき、今後の累積利用料金をシミュレートしたグラフを以下に示す。

比較のため、以前利用していたSoftbankを使い続けていた場合（月額8,800円）の累積利用料金をグラフに追記している。FREETELのイニシャルコストについては、流行りの格安SIMに乗り換えてみた（iPhone SE 64GB SIMフリー版＋FREETEL SIM for iPhone）のポストを参照されたい。

このグラフを見ればわかるように、2017年の5月にグラフの交点が存在し、約1年の利用で、イニシャルコストを回収できていることがわかる。2年間（2018年5月まで）利用できれば、76,440円（（8,800円ー2,430円）×12ヶ月）費用を抑えられることになる。

まとめ

格安SIMに乗り換えてみた結果、実績値としても大手キャリアに比べて、費用を大きく抑えられることがわかった。「格安SIMはなんだか怪しそうだ、なんだか移行が面倒そう」と思っている方でも、本ポストをご覧いただいて一歩を踏み出していただければ幸いである。

2017年2月3日金曜日

LINEの自作スタンプを作ってみた（２）

１月６日（金）に申請したLINEスタンプ。
LINEの自作スタンプを作ってみた（１）

年末年始の関係で審査期間が通常より長くなるという記載があったものの、ちょうど１週間後の１月１３日（金）に審査完了、無事に承認され、販売開始することができた。
審査完了の通知はLINEで連絡がくるため、タイムリーに状況把握することができてとても便利だった。

思った以上に簡単に完了したため、追加で別のスタンプも作成することにした。
今度は40個のスタンプである。

だんだんイラストソフトの操作も慣れてきて効率よく作成できるようになった。
出来上がったイラストがこちら。（下書きと順序はバラバラ）

４０個もあると普段の会話に使えそうである。
来週ごろに使えるようになるのが楽しみだ。

2017年1月6日金曜日

LINEの自作スタンプを作ってみた（１）

以前から気になっていたLINEの自作スタンプ。
素人でも簡単にオリジナルスタンプを作成して販売できるとのことで、少し自由な時間が増えたことをきっかけに、初トライしてみた。

イラスト作成に使用したツールは以下の無料ソフトのみ。

フリーペイントツール (Mac/Win 両対応) FireAlpaca [ ファイアアルパカ ]

そして以下の手順で行った。

①アイディア出し

　まずは普段の会話などで使えそうなワードをリストアップするところから始めた。

　以前はスタンプを42個作成する必要があったが、現在は8個から申請可能ということで、今回は16個を目指してアイディア出しを行った。

②キャラクター決定

　本来ここはこだわるべきところではあるが、イラスト作成に関してはド素人のため、自身で描きやすいキャラクターにせざるを得なかった。

　少なくとも16個は描かないといけないため、シンプルで表情や動きをつけやすい、ネコのキャラクターにした。

③下書き

　キャラクターが決まったので、①のアイディアを元に、構図や表情、ポーズなどを検討した。ここではレポート用紙に鉛筆で描くような簡単なものでOK。

④イラストデータ作成

　下書きした用紙をPCに取り込む。

　本来はスキャンすべきなのだが、プリンタを出すのが面倒だったのiPhoneで写真を撮り、画像を転送するかたちを取ったがそれで十分だった。取り込んだ画像を元に、firealpacaで清書、色付け、書き出しを行う。

　LINEスタンプのガイドラインが決まっているので、その画像サイズに合うように注意して作成する。

　出来上がったのがこちら！

⑤クリエイター登録

　LINE CREATERS MARKETに必要情報を入力し、登録する。

⑥スタンプをアップロード

　登録が完了するとスタンプをアップロードすることができる。

　タイトルや説明文などを先に入れる必要がある。

⑦審査のリクエスト

　無事すべて揃ったら審査をリクエストする。

現在、ここまで完了したところであり、無事承認が下りれば販売開始、ということになる。

また進展があれば続編を投稿したい。

続編を追記しました（2017/2/3）
LINEの自作スタンプを作ってみた（２）

2016年12月4日日曜日

今更聞けないEMアルゴリズムの解説〜潜在変数が連続変数の場合のEステップの説明〜

$\boldsymbol{z}_i$が連続変数の場合について、変分法によって確率分布$Q(\boldsymbol{z}_i) $を求める方法も追記しておく。

すなわち、以下を示す。
$$
\begin{eqnarray}
Q(\boldsymbol{z}_i) &=& P( \boldsymbol{z}_i \mid \boldsymbol{x}_i, \boldsymbol{\theta})
\end{eqnarray}
$$

E-Stepの説明(潜在変数が連続変数の場合)

E-Stepでは、$\boldsymbol{\theta}$固定の下、尤度関数の下界の分布を最大化する。以下で尤度関数を変形し、下界を求める手順を示す。

$$
\begin{eqnarray}
\displaystyle \sum_{ i = 1 }^{ N } \ln P( \boldsymbol{x}_i \mid \boldsymbol{\theta}) &=& \displaystyle \sum_{ i = 1 }^{ N } \ln \displaystyle \int P( \boldsymbol{x}_i, \boldsymbol{z}_i \mid \boldsymbol{\theta}) d\boldsymbol{z}_i\\
&=& \displaystyle \sum_{ i = 1 }^{ N } \ln \displaystyle \int Q(\boldsymbol{z}_i)\frac{P( \boldsymbol{x}_i, \boldsymbol{z}_i \mid \boldsymbol{\theta})}{Q(\boldsymbol{z}_i)} d\boldsymbol{z}_i\\
&\geq& \displaystyle \sum_{ i = 1 }^{ N }\displaystyle \int Q(\boldsymbol{z}_i) \ln \frac{P( \boldsymbol{x}_i, \boldsymbol{z}_i \mid \boldsymbol{\theta})}{Q(\boldsymbol{z}_i)}d\boldsymbol{z}_i
\end{eqnarray}
$$

(2)式から(3)式への変形は、$\boldsymbol{z}_i$の任意の確率分布$Q(\boldsymbol{z}_i) $でかけて割っただけである。この時、$Q(\boldsymbol{z}_i) $は何ら仮定をおいていないことに注意されたい。
(3)式から(4)式への変形は、Jensen's Inequalityを利用した。

この時、$\boldsymbol{\theta}$固定の下、下界(4)式の最大化を考える場合、変分法を用いれば良い。下界(4)式を$Q(\boldsymbol{z}_i) $の汎関数（関数の形を変化させると値が変化する関数。わかりやすい説明は「物理のかぎしっぽ（変分法１）」を参照されたい。）と捉え、変分法によって極値を求める。

特に、$Q = Q(\boldsymbol{z}_i) $とした時、$\boldsymbol{z}_i, Q$によって決まる、以下のようなシンプルな汎関数を考える。

$$
\begin{eqnarray}
\displaystyle \int f( \boldsymbol{z}_i, Q) d\boldsymbol{z}_i
\end{eqnarray}
$$

このシンプルな汎関数を求めるための、オイラー・ラグランジュ方程式は、以下で表せる。（その他の汎関数のオイラー・ラグランジュ方程式については、「物理のかぎしっぽ（変分法２）」を参照されたい。）

$$
\begin{eqnarray}
\frac{ \partial f }{ \partial Q }
\end{eqnarray}
$$

よって、下界(4)式の積分部分に注目し、$\int Q(\boldsymbol{z}_i) d\boldsymbol{z}_i =1$の制約を加えた汎関数は以下となる。

$$
\begin{eqnarray}
\displaystyle \int Q(\boldsymbol{z}_i) \ln \frac{P( \boldsymbol{x}_i, \boldsymbol{z}_i \mid \boldsymbol{\theta})}{Q(\boldsymbol{z}_i)}d\boldsymbol{z}_i - \lambda (1- \int Q(\boldsymbol{z}_i) d\boldsymbol{z}_i )
\end{eqnarray}
$$

これを$Q$で変分すると以下を得る。
$$
\begin{eqnarray}
\ln \frac{P( \boldsymbol{x}_i, \boldsymbol{z}_i \mid \boldsymbol{\theta})}{Q(\boldsymbol{z}_i)}+Q(\boldsymbol{z}_i) \cdot \frac{Q(\boldsymbol{z}_i)}{P( \boldsymbol{x}_i, \boldsymbol{z}_i \mid \boldsymbol{\theta})} \cdot \left[- \frac{P( \boldsymbol{x}_i, \boldsymbol{z}_i \mid \boldsymbol{\theta})}{Q(\boldsymbol{z}_i)^2} \right] + \lambda = 0
\end{eqnarray}
$$
$$
\begin{eqnarray}
\ln \frac{P( \boldsymbol{x}_i, \boldsymbol{z}_i \mid \boldsymbol{\theta})}{Q(\boldsymbol{z}_i)} = -\lambda + 1
\end{eqnarray}
$$
$$
\begin{eqnarray}
Q(\boldsymbol{z}_i) = e^{\lambda - 1}P( \boldsymbol{x}_i, \boldsymbol{z}_i \mid \boldsymbol{\theta})
\end{eqnarray}
$$

$\int Q(\boldsymbol{z}_i) d\boldsymbol{z}_i =1$より、以下を得る。

$$
\begin{eqnarray}
Q(\boldsymbol{z}_i) &=& \frac{P( \boldsymbol{x}_i, \boldsymbol{z}_i \mid \boldsymbol{\theta})}{\int P( \boldsymbol{x}_i, \boldsymbol{z}_i \mid \boldsymbol{\theta}) d\boldsymbol{z}_i}\\
&=& \frac{P( \boldsymbol{x}_i, \boldsymbol{z}_i \mid \boldsymbol{\theta})}{P( \boldsymbol{x}_i \mid \boldsymbol{\theta}) }\\
&=& P( \boldsymbol{z}_i \mid \boldsymbol{x}_i, \boldsymbol{\theta})
\end{eqnarray}
$$

これは、離散分布を仮定し、EMアルゴリズムのE-STEPを説明した「今更聞けないEMアルゴリズムの解説」の(8)式と合致する。

2016年11月15日火曜日

話題のTensorFlow・LINEトーク履歴を用いて対話ボットを作ってみた（５）

ケーススタディ

前振りが長くなったが、開発したソースコードを用いて実際に「対話ボット」を学習し、実行した結果を示す。

学習データ

学習には以下の日常で頻繁に利用している家族とのLINEトークデータを用いた。

全LINEトークデータ　17,796ペア（in/out）
trainデータ 16,017ペア（in/out）#全データの9割
devデータ 1,779ペア（in/out）#全データの1割

ディレクトリ(学習開始時)

学習開始時は、以下の3つのpythonファイルと2つのディレクトリを同階層に配置する。また、line_talk_dataディレクトリには、学習データとして作成した4種類のファイルを格納する。（学習後には中間生成物が各フォルダに生成される）

chatbot.py
data_utils.py
seq2seq_model.py
line_talk_data #学習データ格納用ディレクトリ

line_talk_train.out
line_talk_train.in
line_talk_dev.out
line_talk_dev.in

line_talk_train #学習結果のcheckpointデータ格納用ディレクトリ

学習の実行

python chatbot.py

実行後のコンソールを以下に示す。

chatbot$ python chatbot.py 
Preparing LINE talk data in line_talk_data
Creating vocabulary line_talk_data/vocab40000.out from data line_talk_data/line_talk_train.out
Creating vocabulary line_talk_data/vocab40000.in from data line_talk_data/line_talk_train.in
Tokenizing data in line_talk_data/line_talk_train.out
Tokenizing data in line_talk_data/line_talk_train.in
Tokenizing data in line_talk_data/line_talk_dev.out
Tokenizing data in line_talk_data/line_talk_dev.in
Creating 3 layers of 256 units.
Created model with fresh parameters.
Reading development and training data (limit: 0).
global step 100 learning rate 0.5000 step-time 0.66 perplexity 8820.34
  eval: bucket 0 perplexity 3683.12
  eval: bucket 1 perplexity 4728.98
  eval: bucket 2 perplexity 4118.81
  eval: bucket 3 perplexity 5504.88

以下のように、ある程度収束してきたところで、学習を切り上げる。今回は以下のスペックのMac book pro にて8時間程度学習を行った。

CPU : 2.9 GHz Intel Core i5
メモリ： 16 GB 1867 MHz DDR3

この時、line_talk_trainディレクトリの中に、学習ステップに応じた .ckptファイルが蓄積されている。対話実行時に、蓄積された最新の.ckptファイルはリストアされる。

global step 17000 learning rate 0.3812 step-time 0.49 perplexity 30.66
  eval: bucket 0 perplexity 29.40
  eval: bucket 1 perplexity 45.61
  eval: bucket 2 perplexity 44.65
  eval: bucket 3 perplexity 85.85
global step 17100 learning rate 0.3812 step-time 0.51 perplexity 30.04
  eval: bucket 0 perplexity 113.59
  eval: bucket 1 perplexity 55.54
  eval: bucket 2 perplexity 39.45
  eval: bucket 3 perplexity 45.94

対話の実行（--decode）

python chatbot.py --decode

--decodeオプションをつけて実行することで、対話モードで実行できる。以下に「対話ボット」の対話例を示す。最新の.ckptファイルをリストアされ、対話モードに入っていることが確認出来る。

chatbot$ python chatbot.py --decode
Reading model parameters from line_talk_train/chatbot.ckpt-17600
>

以下に対話例を示す。（[Morpho]タグ行）は入力の形態素解析の結果を出力している。

> 今日は何時に帰る？
([Morpho]:今日 は 何 時 に 帰る ？)
00 : 00 くらい か なー ！
> 今から帰りまーす
([Morpho]:今 から 帰り ま ー す)
今 から ます ！
> おつかれー
([Morpho]:お つかれ ー)
[ スタンプ ]
> 退社しましたー！
([Morpho]:退社 し まし た ー ！)
00 : 00 くらい か なー ！
> お腹空いたのでゆっくり食べてます〜
([Morpho]:お腹 空い た ので ゆっくり 食べ て ます 〜)
たいしゃ

そこそこの対話ができていることがわかる。
「今日は何時に帰る？」という質問に対して、時刻を返答しようとしていることがわかる。パターンとしては上手く学習できているが、実際に情報がないので正しい時刻を返せていないのは残念であるが。
「ゆっくり食べてます〜（先にご飯を食べています）」という情報に対して、遅れて「たいしゃ」したというのも、日常でよくあるやりとりを上手く学習できているといえる。
また、LINEトークのテキストデータのみを学習しているため、一部の返答が[スタンプ]になってしまっているのも仕方がない結果であろう。

まとめ

LINEトーク履歴のテキストデータを用いた「対話ボット」を開発してみた結果、個人のLINEトークの返し方を学習し、それっぽい回答をしてくれることがわかった。翻訳と違い、「良い」「悪い」の基準が極めて曖昧なため、評価が難しいのは事実であるが、可能性を感じる結果にはなった。

今後は、データ量を増加させ、時系列を加味して対話データの生成するなど、学習データの洗練を行いたい。また、今回ハイパーパラメータについても、計算量削減のため、デフォルトの設定よりもだいぶ小さい値を用いている。データ量の増加に合わせて、ハイパーパラメータの調整も併せて行うことで、より自然な対話ができるようになると思われる。

参考

話題のTensorFlow・LINEトーク履歴を用いて対話ボットを作ってみた（１）

話題のTensorFlow・LINEトーク履歴を用いて対話ボットを作ってみた（２）

話題のTensorFlow・LINEトーク履歴を用いて対話ボットを作ってみた（３）

話題のTensorFlow・LINEトーク履歴を用いて対話ボットを作ってみた（４）

話題のTensorFlow・LINEトーク履歴を用いて対話ボットを作ってみた（４）

2. 「対話ボット」学習ロジックの実装

参考2：Tensorflow: Sequence-to-Sequence Models に含まれる、models/rnn配下の以下の3つのソースコードを基に、「対話ボット」に必要な修正を加える。実行時のメインメソッドを含む translate.py をchatbot.pyとして修正を加えた。

No.	src	description
1	translate/seq2seq_model.py	Neural translation sequence-to-sequence model.
2	translate/data_utils.py	Helper functions for preparing translation data.
3	translate/translate.py	Binary that trains and runs the translation model.

学習ロジック関連で、プログラムに修正を加えた点は以下だけである。
2. data_utils.py に対して「(LINEトーク履歴から作成した)学習用データ」の読み込み部分にprepare_line_talk_dataメソッドを作成
3. chatbot.py に対してdata_utils.py内のprepare_line_talk_dataメソッドを呼び出すように修正

3. 「対話ボット」対話ロジックの実装

対話ロジック関連で、プログラムに修正を加えた点は以下だけである。
3. chatbot.py に対して、Decode時の日本語の形態素解析処理の追加

英仏翻訳のためのsequence-to-sequenceモデルに対して、これだけの修正を加えるだけで、「対話ボット」として利用可能となる。

最終的に、以下の３つのソースコードとdataディレクトリを同階層に配備するだけで動作する。

ソースコードを以下に示す。
[Source Code : data_utils.py]

# Copyright 2016 y-euda. All Rights Reserved.
# The following modifications are added based on tensorflow/models/rnn/translate/data_utils.py.
# - prepare_line_talk_data() is created to load LINE talk data text file.
#
#==============================================================================
# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

"""Utilities for downloading data from WMT, tokenizing, vocabularies."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import gzip
import os
import re

from tensorflow.python.platform import gfile
import tensorflow as tf

# Special vocabulary symbols - we always put them at the start.
_PAD = b"_PAD"
_GO = b"_GO"
_EOS = b"_EOS"
_UNK = b"_UNK"
_START_VOCAB = [_PAD, _GO, _EOS, _UNK]

PAD_ID = 0
GO_ID = 1
EOS_ID = 2
UNK_ID = 3

# Regular expressions used to tokenize.
_WORD_SPLIT = re.compile(b"([.,!?\"':;)(])")
_DIGIT_RE = re.compile(br"\d")

def gunzip_file(gz_path, new_path):
  """Unzips from gz_path into new_path."""
  print("Unpacking %s to %s" % (gz_path, new_path))
  with gzip.open(gz_path, "rb") as gz_file:
    with open(new_path, "wb") as new_file:
      for line in gz_file:
        new_file.write(line)

def basic_tokenizer(sentence):
  """Very basic tokenizer: split the sentence into a list of tokens."""
  words = []
  for space_separated_fragment in sentence.strip().split():
    words.extend(_WORD_SPLIT.split(space_separated_fragment))
  return [w for w in words if w]


def create_vocabulary(vocabulary_path, data_path, max_vocabulary_size,
                      tokenizer=None, normalize_digits=True):
  """Create vocabulary file (if it does not exist yet) from data file.

  Data file is assumed to contain one sentence per line. Each sentence is
  tokenized and digits are normalized (if normalize_digits is set).
  Vocabulary contains the most-frequent tokens up to max_vocabulary_size.
  We write it to vocabulary_path in a one-token-per-line format, so that later
  token in the first line gets id=0, second line gets id=1, and so on.

  Args:
    vocabulary_path: path where the vocabulary will be created.
    data_path: data file that will be used to create vocabulary.
    max_vocabulary_size: limit on the size of the created vocabulary.
    tokenizer: a function to use to tokenize each data sentence;
      if None, basic_tokenizer will be used.
    normalize_digits: Boolean; if true, all digits are replaced by 0s.
  """
  if not gfile.Exists(vocabulary_path):
    print("Creating vocabulary %s from data %s" % (vocabulary_path, data_path))
    vocab = {}
    with gfile.GFile(data_path, mode="rb") as f:
      counter = 0
      for line in f:
        counter += 1
        if counter % 100000 == 0:
          print("  processing line %d" % counter)
        line = tf.compat.as_bytes(line)
        tokens = tokenizer(line) if tokenizer else basic_tokenizer(line)
        for w in tokens:
          word = _DIGIT_RE.sub(b"0", w) if normalize_digits else w
          if word in vocab:
            vocab[word] += 1
          else:
            vocab[word] = 1
      vocab_list = _START_VOCAB + sorted(vocab, key=vocab.get, reverse=True)
      if len(vocab_list) > max_vocabulary_size:
        vocab_list = vocab_list[:max_vocabulary_size]
      with gfile.GFile(vocabulary_path, mode="wb") as vocab_file:
        for w in vocab_list:
          vocab_file.write(w + b"\n")

def initialize_vocabulary(vocabulary_path):
  """Initialize vocabulary from file.

  We assume the vocabulary is stored one-item-per-line, so a file:
    dog
    cat
  will result in a vocabulary {"dog": 0, "cat": 1}, and this function will
  also return the reversed-vocabulary ["dog", "cat"].

  Args:
    vocabulary_path: path to the file containing the vocabulary.

  Returns:
    a pair: the vocabulary (a dictionary mapping string to integers), and
    the reversed vocabulary (a list, which reverses the vocabulary mapping).

  Raises:
    ValueError: if the provided vocabulary_path does not exist.
  """
  if gfile.Exists(vocabulary_path):
    rev_vocab = []
    with gfile.GFile(vocabulary_path, mode="rb") as f:
      rev_vocab.extend(f.readlines())
    rev_vocab = [line.strip() for line in rev_vocab]
    vocab = dict([(x, y) for (y, x) in enumerate(rev_vocab)])
    return vocab, rev_vocab
  else:
    raise ValueError("Vocabulary file %s not found.", vocabulary_path)


def sentence_to_token_ids(sentence, vocabulary,
                          tokenizer=None, normalize_digits=True):
  """Convert a string to list of integers representing token-ids.

  For example, a sentence "I have a dog" may become tokenized into
  ["I", "have", "a", "dog"] and with vocabulary {"I": 1, "have": 2,
  "a": 4, "dog": 7"} this function will return [1, 2, 4, 7].

  Args:
    sentence: the sentence in bytes format to convert to token-ids.
    vocabulary: a dictionary mapping tokens to integers.
    tokenizer: a function to use to tokenize each sentence;
      if None, basic_tokenizer will be used.
    normalize_digits: Boolean; if true, all digits are replaced by 0s.

  Returns:
    a list of integers, the token-ids for the sentence.
  """

  if tokenizer:
    words = tokenizer(sentence)
  else:
    words = basic_tokenizer(sentence)
  if not normalize_digits:
    return [vocabulary.get(w, UNK_ID) for w in words]
  # Normalize digits by 0 before looking words up in the vocabulary.
  return [vocabulary.get(_DIGIT_RE.sub(b"0", w), UNK_ID) for w in words]


def data_to_token_ids(data_path, target_path, vocabulary_path,
                      tokenizer=None, normalize_digits=True):
  """Tokenize data file and turn into token-ids using given vocabulary file.

  This function loads data line-by-line from data_path, calls the above
  sentence_to_token_ids, and saves the result to target_path. See comment
  for sentence_to_token_ids on the details of token-ids format.

  Args:
    data_path: path to the data file in one-sentence-per-line format.
    target_path: path where the file with token-ids will be created.
    vocabulary_path: path to the vocabulary file.
    tokenizer: a function to use to tokenize each sentence;
      if None, basic_tokenizer will be used.
    normalize_digits: Boolean; if true, all digits are replaced by 0s.
  """
  if not gfile.Exists(target_path):
    print("Tokenizing data in %s" % data_path)
    vocab, _ = initialize_vocabulary(vocabulary_path)
    with gfile.GFile(data_path, mode="rb") as data_file:
      with gfile.GFile(target_path, mode="w") as tokens_file:
        counter = 0
        for line in data_file:
          counter += 1
          if counter % 100000 == 0:
            print("  tokenizing line %d" % counter)
          token_ids = sentence_to_token_ids(line, vocab, tokenizer,
                                            normalize_digits)
          tokens_file.write(" ".join([str(tok) for tok in token_ids]) + "\n")

def prepare_line_talk_data(data_dir, in_vocabulary_size, out_vocabulary_size, tokenizer=None):
  """Get line talk data into data_dir, create vocabularies and tokenize data.

  Args:
    data_dir: directory in which the data sets will be stored.
    in_vocabulary_size: size of the Input vocabulary to create and use.
    out_vocabulary_size: size of the Output vocabulary to create and use.
    tokenizer: a function to use to tokenize each data sentence;
      if None, basic_tokenizer will be used.

  Returns:
    A tuple of 6 elements:
      (1) path to the token-ids for Input training data-set,
      (2) path to the token-ids for Output training data-set,
      (3) path to the token-ids for Input development data-set,
      (4) path to the token-ids for Output development data-set,
      (5) path to the Input vocabulary file,
      (6) path to the Output vocabulary file.
  """
  # Get line_talk data to the specified directory.
  train_path = os.path.join(data_dir, "line_talk_train")               
  dev_path = os.path.join(data_dir, "line_talk_dev")                     
  
  # Create vocabularies of the appropriate sizes.
  out_vocab_path = os.path.join(data_dir, "vocab%d.out" % out_vocabulary_size ) 
  in_vocab_path = os.path.join(data_dir, "vocab%d.in"  % in_vocabulary_size ) 
  create_vocabulary(out_vocab_path, train_path + ".out", out_vocabulary_size, tokenizer)
  create_vocabulary(in_vocab_path, train_path + ".in", in_vocabulary_size, tokenizer)

  # Create token ids for the training data.
  out_train_ids_path = train_path + (".ids%d.out" % out_vocabulary_size)
  in_train_ids_path = train_path + (".ids%d.in" % in_vocabulary_size)
  data_to_token_ids(train_path + ".out", out_train_ids_path, out_vocab_path, tokenizer)
  data_to_token_ids(train_path + ".in", in_train_ids_path, in_vocab_path, tokenizer)

  # Create token ids for the development data.
  out_dev_ids_path = dev_path + (".ids%d.out" % out_vocabulary_size)
  in_dev_ids_path = dev_path + (".ids%d.in" % in_vocabulary_size)
  data_to_token_ids(dev_path + ".out", out_dev_ids_path, out_vocab_path, tokenizer)
  data_to_token_ids(dev_path + ".in", in_dev_ids_path, in_vocab_path, tokenizer)

  return (in_train_ids_path, out_train_ids_path,
          in_dev_ids_path, out_dev_ids_path,
          in_vocab_path, out_vocab_path)

[Source Code : chatbot.py]

# Copyright 2016 y-euda. All Rights Reserved.
# The following modifications are added based on tensorflow/models/rnn/translate/translate.py.
# - train() is modified to load LINE talk data text file.
# - decode() is modifed for the input sentence in Japanese.
#
# ==============================================================================
# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

"""Binary for training seq2seq models and decoding from them.

Running this program without --decode will download the line talk corpus into
the directory specified as --data_dir and tokenize it in a very basic way,
and then start training a model saving checkpoints to --train_dir.

Running with --decode starts an interactive loop so you can see how
the current checkpoint translates English sentences into French.

See the following papers for more information on neural translation models.
 * http://arxiv.org/abs/1409.3215
 * http://arxiv.org/abs/1409.0473
 * http://arxiv.org/abs/1412.2007
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import math
import os
import random
import sys
import time
import logging

import numpy as np
import tensorflow as tf


import data_utils as data_utils
import seq2seq_model as seq2seq_model

from janome.tokenizer import Tokenizer

tf.app.flags.DEFINE_float("learning_rate", 0.5, "Learning rate.")
tf.app.flags.DEFINE_float("learning_rate_decay_factor", 0.99,
                          "Learning rate decays by this much.")
tf.app.flags.DEFINE_float("max_gradient_norm", 5.0,
                          "Clip gradients to this norm.")
tf.app.flags.DEFINE_integer("batch_size", 4, #64
                            "Batch size to use during training.")
tf.app.flags.DEFINE_integer("size", 256, "Size of each model layer.") #1024
tf.app.flags.DEFINE_integer("num_layers", 3, "Number of layers in the model.") #3
tf.app.flags.DEFINE_integer("en_vocab_size", 40000, "English vocabulary size.") #40000
tf.app.flags.DEFINE_integer("fr_vocab_size", 40000, "French vocabulary size.") #40000
tf.app.flags.DEFINE_string("data_dir", "line_talk_data", "Data directory")#data
tf.app.flags.DEFINE_string("train_dir", "line_talk_train", "Training directory.")#train
tf.app.flags.DEFINE_integer("max_train_data_size", 0,
                            "Limit on the size of training data (0: no limit).")
tf.app.flags.DEFINE_integer("steps_per_checkpoint", 100,#200
                            "How many training steps to do per checkpoint.")
tf.app.flags.DEFINE_boolean("decode", False,
                            "Set to True for interactive decoding.")
tf.app.flags.DEFINE_boolean("self_test", False,
                            "Run a self-test if this is set to True.")
tf.app.flags.DEFINE_boolean("use_fp16", False,
                            "Train using fp16 instead of fp32.")

FLAGS = tf.app.flags.FLAGS

# We use a number of buckets and pad to the closest one for efficiency.
# See seq2seq_model.Seq2SeqModel for details of how they work.
_buckets = [(5, 10), (10, 15), (20, 25), (40, 50)]


def read_data(source_path, target_path, max_size=None):
  """Read data from source and target files and put into buckets.

  Args:
    source_path: path to the files with token-ids for the source language.
    target_path: path to the file with token-ids for the target language;
      it must be aligned with the source file: n-th line contains the desired
      output for n-th line from the source_path.
    max_size: maximum number of lines to read, all other will be ignored;
      if 0 or None, data files will be read completely (no limit).

  Returns:
    data_set: a list of length len(_buckets); data_set[n] contains a list of
      (source, target) pairs read from the provided data files that fit
      into the n-th bucket, i.e., such that len(source) < _buckets[n][0] and
      len(target) < _buckets[n][1]; source and target are lists of token-ids.
  """
  data_set = [[] for _ in _buckets]
  with tf.gfile.GFile(source_path, mode="r") as source_file:
    with tf.gfile.GFile(target_path, mode="r") as target_file:
      source, target = source_file.readline(), target_file.readline()
      counter = 0
      while source and target and (not max_size or counter < max_size):
        counter += 1
        if counter % 100000 == 0:
          print("  reading data line %d" % counter)
          sys.stdout.flush()
        source_ids = [int(x) for x in source.split()]
        target_ids = [int(x) for x in target.split()]
        target_ids.append(data_utils.EOS_ID)
        for bucket_id, (source_size, target_size) in enumerate(_buckets):
          if len(source_ids) < source_size and len(target_ids) < target_size:
            data_set[bucket_id].append([source_ids, target_ids])
            break
        source, target = source_file.readline(), target_file.readline()
  return data_set


def create_model(session, forward_only):
  """Create chat model and initialize or load parameters in session."""
  dtype = tf.float16 if FLAGS.use_fp16 else tf.float32
  model = seq2seq_model.Seq2SeqModel(
      FLAGS.en_vocab_size,
      FLAGS.fr_vocab_size,
      _buckets,
      FLAGS.size,
      FLAGS.num_layers,
      FLAGS.max_gradient_norm,
      FLAGS.batch_size,
      FLAGS.learning_rate,
      FLAGS.learning_rate_decay_factor,
      forward_only=forward_only,
      dtype=dtype)
  ckpt = tf.train.get_checkpoint_state(FLAGS.train_dir)
  if ckpt and tf.gfile.Exists(ckpt.model_checkpoint_path):
    print("Reading model parameters from %s" % ckpt.model_checkpoint_path)
    model.saver.restore(session, ckpt.model_checkpoint_path)
  else:
    print("Created model with fresh parameters.")
    session.run(tf.initialize_all_variables())
  return model

def train():
  """Train a in->out chat model using LINE talk data."""
  # Prepare line talk data.
  print("Preparing LINE talk data in %s" % FLAGS.data_dir)
  in_train, out_train, in_dev, out_dev, _, _ = data_utils.prepare_line_talk_data(
      FLAGS.data_dir, FLAGS.en_vocab_size, FLAGS.fr_vocab_size)

  with tf.Session() as sess:
    # Create model.
    print("Creating %d layers of %d units." % (FLAGS.num_layers, FLAGS.size))
    model = create_model(sess, False)

    # Read data into buckets and compute their sizes.
    print ("Reading development and training data (limit: %d)."
           % FLAGS.max_train_data_size)
    dev_set = read_data(in_dev, out_dev)
    train_set = read_data(in_train, out_train, FLAGS.max_train_data_size)
    train_bucket_sizes = [len(train_set[b]) for b in xrange(len(_buckets))]
    train_total_size = float(sum(train_bucket_sizes))

    # A bucket scale is a list of increasing numbers from 0 to 1 that we'll use
    # to select a bucket. Length of [scale[i], scale[i+1]] is proportional to
    # the size if i-th training bucket, as used later.
    train_buckets_scale = [sum(train_bucket_sizes[:i + 1]) / train_total_size
                           for i in xrange(len(train_bucket_sizes))]

    # This is the training loop.
    step_time, loss = 0.0, 0.0
    current_step = 0
    previous_losses = []
    while True:
      # Choose a bucket according to data distribution. We pick a random number
      # in [0, 1] and use the corresponding interval in train_buckets_scale.
      random_number_01 = np.random.random_sample()
      bucket_id = min([i for i in xrange(len(train_buckets_scale))
                       if train_buckets_scale[i] > random_number_01])

      # Get a batch and make a step.
      start_time = time.time()
      encoder_inputs, decoder_inputs, target_weights = model.get_batch(
          train_set, bucket_id)
      _, step_loss, _ = model.step(sess, encoder_inputs, decoder_inputs,
                                   target_weights, bucket_id, False)
      step_time += (time.time() - start_time) / FLAGS.steps_per_checkpoint
      loss += step_loss / FLAGS.steps_per_checkpoint
      current_step += 1

      # Once in a while, we save checkpoint, print statistics, and run evals.
      if current_step % FLAGS.steps_per_checkpoint == 0:
        # Print statistics for the previous epoch.
        perplexity = math.exp(float(loss)) if loss < 300 else float("inf")
        print ("global step %d learning rate %.4f step-time %.2f perplexity "
               "%.2f" % (model.global_step.eval(), model.learning_rate.eval(),
                         step_time, perplexity))
        # Decrease learning rate if no improvement was seen over last 3 times.
        if len(previous_losses) > 2 and loss > max(previous_losses[-3:]):
          sess.run(model.learning_rate_decay_op)
        previous_losses.append(loss)
        # Save checkpoint and zero timer and loss.
        checkpoint_path = os.path.join(FLAGS.train_dir, "chatbot.ckpt")
        model.saver.save(sess, checkpoint_path, global_step=model.global_step)
        step_time, loss = 0.0, 0.0
        # Run evals on development set and print their perplexity.
        for bucket_id in xrange(len(_buckets)):
          if len(dev_set[bucket_id]) == 0:
            print("  eval: empty bucket %d" % (bucket_id))
            continue
          encoder_inputs, decoder_inputs, target_weights = model.get_batch(
              dev_set, bucket_id)
          _, eval_loss, _ = model.step(sess, encoder_inputs, decoder_inputs,
                                       target_weights, bucket_id, True)
          eval_ppx = math.exp(float(eval_loss)) if eval_loss < 300 else float(
              "inf")
          print("  eval: bucket %d perplexity %.2f" % (bucket_id, eval_ppx))
        sys.stdout.flush()

#--decode --data_dir line_talk_data --train_dir line_talk_data
def decode():
  with tf.Session() as sess:
    # Create model and load parameters.
    model = create_model(sess, True)
    model.batch_size = 1  # We decode one sentence at a time.

    # Load vocabularies.
    en_vocab_path = os.path.join(FLAGS.data_dir,
                                 "vocab%d.in" % FLAGS.en_vocab_size)
    fr_vocab_path = os.path.join(FLAGS.data_dir,
                                 "vocab%d.out" % FLAGS.fr_vocab_size)
    en_vocab, _ = data_utils.initialize_vocabulary(en_vocab_path)
    _, rev_fr_vocab = data_utils.initialize_vocabulary(fr_vocab_path)

    # Decode from standard input.
    sys.stdout.write("> ")
    sys.stdout.flush()
    sentence = sys.stdin.readline()
    t = Tokenizer()
    tokens = t.tokenize(sentence.decode('utf-8')) 
    sentence = ' '.join([token.surface for token in tokens]).encode('utf-8') 
    print('([Morpho]:'+ sentence +')')

    while sentence:
      # Get token-ids for the input sentence.
      token_ids = data_utils.sentence_to_token_ids(tf.compat.as_bytes(sentence), en_vocab)
      # Which bucket does it belong to?
      bucket_id = len(_buckets) - 1
      for i, bucket in enumerate(_buckets):
        if bucket[0] >= len(token_ids):
          bucket_id = i
          break
      else:
        logging.warning("Sentence truncated: %s", sentence) 

      # Get a 1-element batch to feed the sentence to the model.
      encoder_inputs, decoder_inputs, target_weights = model.get_batch(
          {bucket_id: [(token_ids, [])]}, bucket_id)
      # Get output logits for the sentence.
      _, _, output_logits = model.step(sess, encoder_inputs, decoder_inputs,
                                       target_weights, bucket_id, True)
      # This is a greedy decoder - outputs are just argmaxes of output_logits.
      outputs = [int(np.argmax(logit, axis=1)) for logit in output_logits]
      # If there is an EOS symbol in outputs, cut them at that point.
      if data_utils.EOS_ID in outputs:
        outputs = outputs[:outputs.index(data_utils.EOS_ID)]
      # Print out French sentence corresponding to outputs.
      print(" ".join([tf.compat.as_str(rev_fr_vocab[output]) for output in outputs]))
      print("> ", end="")
      sys.stdout.flush()
      sentence = sys.stdin.readline()
      t = Tokenizer()
      tokens = t.tokenize(sentence.decode('utf-8')) 
      sentence = ' '.join([token.surface for token in tokens]).encode('utf-8') 
      print('([Morpho]:'+ sentence +')')


def self_test():
  """Test the translation model."""
  with tf.Session() as sess:
    print("Self-test for neural translation model.")
    # Create model with vocabularies of 10, 2 small buckets, 2 layers of 32.
    model = seq2seq_model.Seq2SeqModel(10, 10, [(3, 3), (6, 6)], 32, 2,
                                       5.0, 32, 0.3, 0.99, num_samples=8)
    sess.run(tf.initialize_all_variables())

    # Fake data set for both the (3, 3) and (6, 6) bucket.
    data_set = ([([1, 1], [2, 2]), ([3, 3], [4]), ([5], [6])],
                [([1, 1, 1, 1, 1], [2, 2, 2, 2, 2]), ([3, 3, 3], [5, 6])])
    for _ in xrange(5):  # Train the fake model for 5 steps.
      bucket_id = random.choice([0, 1])
      encoder_inputs, decoder_inputs, target_weights = model.get_batch(
          data_set, bucket_id)
      model.step(sess, encoder_inputs, decoder_inputs, target_weights,
                 bucket_id, False)

def main(_):
  if FLAGS.self_test:
    self_test()
  elif FLAGS.decode:
    decode()
  else:
    train()

if __name__ == "__main__":
  tf.app.run()

話題のTensorFlow・LINEトーク履歴を用いて対話ボットを作ってみた（３）

開発のために必要なこと

テキストベースの「対話ボット」の開発にあたって、最近劇的な精度向上によって話題になっているGoogle翻訳のベースの技術として利用されている、Googleが開発したライブラリである「Tensorflow」の”encoder-decoder sequence-to-sequence model”を用いる。Tensorflowの利用にあたっては、C++とPythonのAPIが用意されているが、現時点（2016/11/14）では、Python APIが最も開発が進んでおり、手軽なため、以後Pythonで開発する事を前提とする。

ちなみにTensorflowは、 Apache 2.0 open source license の下で、OSSとして公開されており、本モデル以外にも、チュートリアルと共に様々なDNN応用例が実装されている。

参考1：「Google翻訳が進化!?　精度が向上したと話題に（ディープラーニングによる新翻訳システムが導入されたとみられています。）2016年11月12日 10時23分更新」
http://nlab.itmedia.co.jp/nl/articles/1611/12/news021.html

Fig. Google翻訳にるArtificial Intelligenceの日本語訳結果

参考2：Tensorflow: Sequence-to-Sequence Models
https://www.tensorflow.org/versions/r0.11/tutorials/seq2seq/index.html#sequence-to-sequence-models

参考2：Tensorflow: Sequence-to-Sequence Models のチュートリアルでは、同モデルを用いて英仏翻訳を行う例が紹介されている。英語の単語のシーケンスを入力とし、フランス語の単語のシーケンスを出力するモデルである。この構造を「対話ボット」に適用する場合、以下の作業が必要となる。

1. 「対話ボット」学習データの整備
2. 「対話ボット」学習ロジックの実装
3. 「対話ボット」対話ロジックの実装

次に各作業の詳細について説明する。

1.「対話ボット」学習データの整備

チュートリアルの英仏翻訳学習用のデータは、対訳データと呼ばれるデータである。これは、英語の文章に対して、翻訳結果となるフランス語の文章が対となった形で、それぞれのファイルに格納され、各行が対をなしているデータである。また、英語もフランス語も各単語が半角スペースで区切られ、単語のシーケンスとなっている。

これに対して、今回対象とする対話ボットは、LINEトーク履歴データ（日本語）を用いた「対話ボット」であるが故に以下が必要となる。

・対話を形成する日本語文対の作成

LINE のトークデータは以下の手順で簡単にテキストデータとしてエクスポート可能である。

　（iPhoneの場合）出力対象のトーク画面→「設定」→「トーク履歴を送信」

エクスポートされたテキストデータは、以下のような形式である。

　ファイル名　：[LINE]トーク相手名.txt

ex1.[LINE]田中太郎.txt

　ファイル内容：HH:MM¥tトーク者¥t発話内容

ex2.
08:40 太郎おはようー
08:40 太郎ございます！
08:41 太郎 [スタンプ]
09:00 花子今日は起きれたんだね。
09:02 太郎 [スタンプ]
　：　：　：

このファイルを用いて入力文と出力文のペアを作成する。

LINEトークの特徴として、一方的に短文を何度も発話する場合が往々にしてあるため、シーケンスの単位として、連続して発話した一連の内容を１単位とした。簡単のため、発話時刻の間隔は考慮せず、連続で発話したすべてをまとめて一つの単位とした。

ex3.
太郎おはようーございます！[スタンプ]
花子今日は起きれたんだね。
太郎 [スタンプ]
　：　：

また、発話の順序性については、本来は発話時刻を考慮してセッションを考え、対話となっているペア生成する必要がある。しかし、今回は上の処理で一連の発話内容を１単位とした後、最初の発話者を入力担当、次の発話者を出力担当と割り振り、順々にペアを生成した。

ex4. 入力部分
おはようーございます！[スタンプ]
[スタンプ]
　：　

ex5. 出力部分
今日は起きれたんだね。
　：　

・日本語文の分かち書き

日本語や中国語のような文章中に区切り文字が存在しない文章は、明示的なシーケンスを表現するために、形態素解析を行い、意味のある単位（＝形態素）で文を分解する必要がある。

今回はPure Pythonで開発された形態素解析エンジンであるJanome（蛇の目）を用いる。

Janome (0.2.8)
http://mocobeta.github.io/janome/

以下にサンプルコードを示す。

from janome.tokenizer import Tokenizer
    with open('../data/line_talk.in', mode = 'w') as fw:
        t = Tokenizer()
        for line in inputs:
            tokens = t.tokenize(line) 
            line = ' '.join([token.surface for token in tokens]).encode('utf-8') + '\n'            
            fw.write(line)

分かち書きを行った入力部分、出力部分をそれぞれファイルに格納する。

ex6. line_talk.in
おはようーございます！ [スタンプ]
[スタンプ]
　：　

ex7. line_talk.out
今日は起きれたんだね。
　：　

話題のTensorFlow・LINEトーク履歴を用いて対話ボットを作ってみた（１）

話題のTensorFlow・LINEトーク履歴を用いて対話ボットを作ってみた（２）

話題のTensorFlow・LINEトーク履歴を用いて対話ボットを作ってみた（４）

話題のTensorFlow・LINEトーク履歴を用いて対話ボットを作ってみた（５）

登録: 投稿 (Atom)