*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
と表示されます。文脈を調べたいときに、類似の表現などを探すことができます。例えば、"Sense and Sensibility "by Jane Austen 1811 で 'affection' という言葉がどのように使用されているかを見たいとき、メソッド 'concordance' を用いて以下のように打ちます。
Displaying 25 of 25 matches:
, however , and , as a mark of his affection for the three girls , he left them
t . It was very well known that no affection was ever supposed to exist between
deration of politeness or maternal affection on the side of the former , the tw
d the suspicion -- the hope of his affection for me may warrant , without impru
hich forbade the indulgence of his affection . She knew that his mother neither
rd she gave one with still greater affection . Though her late conversation wit
can never hope to feel or inspire affection again , and if her home be uncomfo
m of the sense , elegance , mutual affection , and domestic comfort of the fami
, and which recommended him to her affection beyond every thing else . His soci
ween the parties might forward the affection of Mr . Willoughby , an equally st
the most pointed assurance of her affection . Elinor could not be surprised at
he natural consequence of a strong affection in a young and ardent mind . This
opinion . But by an appeal to her affection for her mother , by representing t
every alteration of a place which affection had established as perfect with hi
e will always have one claim of my affection , which no other can possibly shar
f the evening declared at once his affection and happiness . " Shall we see you
ause he took leave of us with less affection than his usual behaviour has shewn
ness ." " I want no proof of their affection ," said Elinor ; " but of their en
onths , without telling her of his affection ;-- that they should part without
ould be the natural result of your affection for her . She used to be all unres
distinguished Elinor by no mark of affection . Marianne saw and listened with i
th no inclination for expense , no affection for strangers , no profession , an
till distinguished her by the same affection which once she had felt no doubt o
al of her confidence in Edward ' s affection , to the remembrance of every mark
was made ? Had he never owned his affection to yourself ?" " Oh , no ; but if
文書(text)の長さ(the length of a text from start to finish, in terms of the words and puctuation symbols)を知りたいときは、
のように入力します。Genesis(text3)は44,764個の 'words and punctuation symbols' から構成されていることが確認できます。これらのキャラクターの集合は "tokens" とも呼ばれます。文書(text)の語彙は"tokens"の集合となります。"tokens"の集合を表示したいときは、以下のように入力します。
emma = nltk.Text(nltk.corpus.gutenberg.words('austen-emma.txt'))
from nltk.corpus import gutenberg
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', ...]
emma = gutenberg.words('austen-emma.txt')
nltk.corpusを用いて読み込んだテキストは、words(), raw(), and sents()などのメソッドが有効になります。
Gutenberg Project は50000冊を超える無料の電子版文書を収集しています。Gutenberg Projectで提供されてい文書はこのサイトからダウンロードできます。ダウンロードしたファイルが '2554-0.txt' のとき、以下のようなコマンドを使用して、言語処理をすることができます。2554-0.txt は、'Crime and Punishment', by Fyodor Dostoevsky です。rawはstr形式のリストなので、'\n'や'\r'などの記号が混ざっています。単語と句読点のみからなるトークン集合を作成するために、word_torkenize()を使います。詳しいNLTKのマニュアルはこのサイトにあります。
import nltk
from nltk import word_tokenize
path = "./corpora/2554-0.txt"
f = open(path, "rb").read()
raw = f.decode("utf-8")
tokens = word_tokenize(raw)
text = nltk.Text(tokens)
vocab = sorted(tokens)
print("Size of text: ",len(text))
chars = sorted(list(set(text)))
print("Total chars :",len(chars))
Displaying 2 of 2 matches:
beziatnikov , who felt a return of affection for Pyotr Petrovitch . “ And , wha
sure that it was only your special affection for my poor husband that has made
corpus.reader を使用するときは、
import nltk
from nltk.text import Text
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
corpus = PlaintextCorpusReader('./corpora/', '2554-0.txt')
crime = Text(w for w in corpus.words())
import nltk
from nltk import re
raw = """'When I'M a Duchess,' she said to herself, (not in a very hopeful tone
... though), 'I won't have any pepper in my kitchen AT ALL. Soup does very
... well without--Maybe it's always pepper that makes people hot-tempered,'..."""
re.split(r' ', raw)
Keras+Tensorflow を用いた意見マイニングの処理を取り上げます。データとして、IMDB(Internet Movie DataBase)を使用します。 IMDBは50000件の映画レビューのデータを収集したものです。データセットは、学習用に25000、テスト用に25000に2分割されています。映画に対する肯定的(positive)、否定的(negative)な意見(ラベル)の割合は、学習用、テスト用ともに50%ずつ分けられています。IMDBデータセットは、各レビューに対して、肯定/否定のラベル(教師)づけがされています。
import numpy as np
from tensorflow import keras
import tensorflow as tf
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)
print("Training entries: {}, labels: {}".format(len(train_data), len(train_labels)))
Training entries: 25000, labels: 25000
辞書のメソッド dict によって、整数のインデックスのリストで表されているデータを文字列にデコードします。1番目のデータのテキストを表示してみます。引数 num_words=10000 は訓練データにおいて最も頻度高く出現する top 10,000 の単語を保持します。データのサイズを管理可能に保つために稀な単語は捨てられます。
# A dictionary mapping words to an integer index
word_index = imdb.get_word_index()
# The first indices are reserved
word_index = {k:(v+3) for k,v in word_index.items()}
word_index[""] = 0
word_index[""] = 1
word_index[""] = 2 # unknown
word_index[""] = 3
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
def decode_review(text):
return ' '.join([reverse_word_index.get(i, '?') for i in text])
