20 KiB

Исходник Вина История Unescape Escape

Отчёт по лабораторной работе №4

Ли Тэ Хо, Синявский Степан — А-02-22

Бригада 3

Задание 1

1) В среде Google Colab создали новый блокнот (notebook). Импортировали необходимые для работы библиотеки и модули. Настроили блокнот для работы с аппаратным ускорителем GPU.

# импорт модулей
import os
os.chdir('/content/drive/MyDrive/Colab Notebooks/is_lab4')

from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential
import matplotlib.pyplot as plt
import numpy as np

import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

Found GPU at: /device:GPU:0

2) Загрузили набор данных IMDb, содержащий оцифрованные отзывы на фильмы, размеченные на два класса: позитивные и негативные. При загрузке набора данных параметр seed выбрали равным значению (4k – 1)=11, где k=3 – номер бригады. Вывели размеры полученных обучающих и тестовых массивов данных.

# загрузка датасета
from keras.datasets import imdb

vocabulary_size = 5000
index_from = 3

(X_train, y_train), (X_test, y_test) = imdb.load_data(
    path="imdb.npz",
    num_words=vocabulary_size,
    skip_top=0,
    maxlen=None,
    seed=11,
    start_char=1,
    oov_char=2,
    index_from=index_from
    )

# вывод размерностей
print('Shape of X train:', X_train.shape)
print('Shape of y train:', y_train.shape)
print('Shape of X test:', X_test.shape)
print('Shape of y test:', y_test.shape)

Shape of X train: (25000,)
Shape of y train: (25000,)
Shape of X test: (25000,)
Shape of y test: (25000,)

3) Вывели один отзыв из обучающего множества в виде списка индексов слов. Преобразовали список индексов в текст и вывели отзыв в виде текста. Вывели длину отзыва. Вывели метку класса данного отзыва и название класса (1 – Positive, 0 – Negative).

# создание словаря для перевода индексов в слова
# заргузка словаря "слово:индекс"
word_to_id = imdb.get_word_index()
# уточнение словаря
word_to_id = {key:(value + index_from) for key,value in word_to_id.items()}
word_to_id["<PAD>"] = 0
word_to_id["<START>"] = 1
word_to_id["<UNK>"] = 2
word_to_id["<UNUSED>"] = 3
# создание обратного словаря "индекс:слово"
id_to_word = {value:key for key,value in word_to_id.items()}

print(X_train[26])
print('len:',len(X_train[26]))

[1, 2489, 723, 2, 9, 399, 2301, 11, 551, 2, 29, 47, 1391, 6, 1692, 15, 29, 70, 361, 8, 97, 35, 3258, 40, 6, 2, 106, 42, 2, 4298, 64, 8, 28, 15, 3258, 796, 2, 11, 6, 275, 1622, 21, 50, 26, 148, 33, 27, 2301, 2, 15, 81, 24, 40, 42, 2, 7, 27, 4646, 5, 80, 81, 845, 12, 304, 8, 67, 15, 29, 152, 3115, 103, 6, 1196, 2, 15, 238, 28, 1894, 27, 2, 2489, 2, 1068, 8, 2181, 27, 1692, 23, 309, 17, 873, 183, 140, 2357, 355, 5, 29, 9, 2, 83, 6, 2699, 1765, 2, 625, 2691, 1229, 80, 516, 10, 10, 11, 2, 279, 12, 286, 141, 6, 52, 326, 8, 796, 106, 4, 2, 132, 11, 4, 172, 1269, 13, 296, 4, 2223, 994, 7, 4, 2223, 5, 3176, 7, 4, 2223, 50, 186, 8, 30, 64, 38, 111, 102, 44, 551, 2, 5, 4, 4616, 3388, 302, 12, 70, 28, 23, 4, 406, 648, 15, 31, 415, 144, 30, 93, 8, 4325, 11, 6, 289, 42, 689, 251, 810, 146, 24, 252, 51, 148, 1893, 18, 4, 20, 1029, 17, 68, 2436, 819, 18, 4, 2, 132, 21, 76, 7, 12, 9, 38, 729, 8, 4, 2223, 102, 15, 12, 566, 30, 2691, 2, 190, 4, 2, 132, 218, 60, 754, 17, 52, 17, 4, 249, 7, 4, 2223, 2355, 10, 10, 1371, 112, 1905, 4981, 4, 2, 132, 47, 450, 85, 712, 15, 66, 1487, 4, 3129, 7, 4, 20, 6, 194, 1834, 13, 28, 9, 19, 2, 2, 11, 4, 485, 240, 141, 6, 2, 1995, 15, 24, 64, 81, 13, 24, 459, 44, 27, 2073, 13, 165, 3663, 18, 12, 696, 177, 1066, 1083, 2, 5, 2, 1602, 26, 220, 17, 78, 507, 38, 1904, 5, 753, 36, 983, 551, 11, 192, 225, 55, 117, 8, 79, 2229, 44, 137, 149, 4, 2, 132, 4, 816, 475, 24, 55, 906, 4, 168, 475, 13, 62, 1634, 76, 7, 12, 17, 2, 4, 114, 475, 727, 4, 206, 475, 50, 218, 101, 444, 14, 9, 31, 8, 798, 10, 10, 2994, 13, 296, 4, 2, 132, 2864, 6, 1039, 7, 4, 736, 1067, 750, 2, 390, 163, 538, 137, 24, 35, 1557, 55, 400, 4, 2, 4, 20, 475, 4, 128, 4, 3179, 2, 4, 493, 569, 220, 32, 7, 68, 3734, 19, 4, 2, 132, 637, 202, 12, 6, 55, 2, 470, 457, 23, 61, 3179, 675, 2407]
len: 413

review_as_text = ' '.join(id_to_word[id] for id in X_train[26])
print(review_as_text)
print('len:',len(review_as_text))

<START> professor paul <UNK> is doing research in matter <UNK> he has developed a machine that he can use to make an object like a <UNK> watch or <UNK> disappear only to have that object re <UNK> in a different location but there are those at his research <UNK> that do not like or <UNK> of his experiments and will do whatever it takes to see that he doesn't succeed after a failed <UNK> that might have saved his <UNK> professor <UNK> decides to test his machine on himself as expected things go horribly wrong and he is <UNK> into a heavily scared <UNK> whose mere touch will kill br br in <UNK> maybe it wasn't such a good idea to re watch the <UNK> man in the same week i watched the fly return of the fly and curse of the fly there seems to be only so many movies about matter <UNK> and the potentially horrendous effects it can have on the human body that one person should be made to endure in a three or four day period i'm not sure what those responsible for the movie list as their source material for the <UNK> man but much of it is so similar to the fly movies that it cannot be mere <UNK> however the <UNK> man isn't even nearly as good as the worst of the fly trilogy br br besides being terribly unoriginal the <UNK> man has several other problems that really hurt the enjoyment of the movie a big issue i have is with <UNK> <UNK> in the lead he's such a <UNK> ass that not only do i not care about his suffering i actually root for it supporting cast members mary <UNK> and <UNK> allen are almost as bad they're so bland and dull they hardly matter in fact there's very little to get excited about while watching the <UNK> man the soundtrack  not very memorable the look  i would describe much of it as <UNK> the plot  predictable the action  there isn't any overall this is one to avoid br br fortunately i watched the <UNK> man via a copy of the mystery science theater <UNK> episode funny stuff while not an absolute very often the <UNK> the movie  the better the mst3k <UNK> the guys hit almost all of their marks with the <UNK> man i'll give it a very <UNK> 4 5 on my mst3k rating scale
len: 2113

4) Вывели максимальную и минимальную длину отзыва в обучающем множестве.

print('MAX Len: ',len(max(X_train, key=len)))
print('MIN Len: ',len(min(X_train, key=len)))

MAX Len:  2494
MIN Len:  11

5) Провели предобработку данных. Выбрали единую длину, к которой будут приведены все отзывы. Короткие отзывы дополнили спецсимволами, а длинные обрезали до выбранной длины.

# предобработка данных
from tensorflow.keras.utils import pad_sequences
max_words = 500
X_train = pad_sequences(X_train, maxlen=max_words, value=0, padding='pre', truncating='post')
X_test = pad_sequences(X_test, maxlen=max_words, value=0, padding='pre', truncating='post')

6) Повторили пункт 4.

print('MAX Len: ',len(max(X_train, key=len)))
print('MIN Len: ',len(min(X_train, key=len)))

MAX Len:  500
MIN Len:  500

7) Повторили пункт 3. Сделали вывод о том, как отзыв преобразовался после предобработки.

print(X_train[26])
print('len:',len(X_train[26]))

[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    1 2489  723    2    9  399 2301   11  551    2   29
   47 1391    6 1692   15   29   70  361    8   97   35 3258   40    6
    2  106   42    2 4298   64    8   28   15 3258  796    2   11    6
  275 1622   21   50   26  148   33   27 2301    2   15   81   24   40
   42    2    7   27 4646    5   80   81  845   12  304    8   67   15
   29  152 3115  103    6 1196    2   15  238   28 1894   27    2 2489
    2 1068    8 2181   27 1692   23  309   17  873  183  140 2357  355
    5   29    9    2   83    6 2699 1765    2  625 2691 1229   80  516
   10   10   11    2  279   12  286  141    6   52  326    8  796  106
    4    2  132   11    4  172 1269   13  296    4 2223  994    7    4
 2223    5 3176    7    4 2223   50  186    8   30   64   38  111  102
   44  551    2    5    4 4616 3388  302   12   70   28   23    4  406
  648   15   31  415  144   30   93    8 4325   11    6  289   42  689
  251  810  146   24  252   51  148 1893   18    4   20 1029   17   68
 2436  819   18    4    2  132   21   76    7   12    9   38  729    8
    4 2223  102   15   12  566   30 2691    2  190    4    2  132  218
   60  754   17   52   17    4  249    7    4 2223 2355   10   10 1371
  112 1905 4981    4    2  132   47  450   85  712   15   66 1487    4
 3129    7    4   20    6  194 1834   13   28    9   19    2    2   11
    4  485  240  141    6    2 1995   15   24   64   81   13   24  459
   44   27 2073   13  165 3663   18   12  696  177 1066 1083    2    5
    2 1602   26  220   17   78  507   38 1904    5  753   36  983  551
   11  192  225   55  117    8   79 2229   44  137  149    4    2  132
    4  816  475   24   55  906    4  168  475   13   62 1634   76    7
   12   17    2    4  114  475  727    4  206  475   50  218  101  444
   14    9   31    8  798   10   10 2994   13  296    4    2  132 2864
    6 1039    7    4  736 1067  750    2  390  163  538  137   24   35
 1557   55  400    4    2    4   20  475    4  128    4 3179    2    4
  493  569  220   32    7   68 3734   19    4    2  132  637  202   12
    6   55    2  470  457   23   61 3179  675 2407]
len: 500

review_as_text = ' '.join(id_to_word[id] for id in X_train[26])
print(review_as_text)
print('len:',len(review_as_text))

<PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <START> professor paul <UNK> is doing research in matter <UNK> he has developed a machine that he can use to make an object like a <UNK> watch or <UNK> disappear only to have that object re <UNK> in a different location but there are those at his research <UNK> that do not like or <UNK> of his experiments and will do whatever it takes to see that he doesn't succeed after a failed <UNK> that might have saved his <UNK> professor <UNK> decides to test his machine on himself as expected things go horribly wrong and he is <UNK> into a heavily scared <UNK> whose mere touch will kill br br in <UNK> maybe it wasn't such a good idea to re watch the <UNK> man in the same week i watched the fly return of the fly and curse of the fly there seems to be only so many movies about matter <UNK> and the potentially horrendous effects it can have on the human body that one person should be made to endure in a three or four day period i'm not sure what those responsible for the movie list as their source material for the <UNK> man but much of it is so similar to the fly movies that it cannot be mere <UNK> however the <UNK> man isn't even nearly as good as the worst of the fly trilogy br br besides being terribly unoriginal the <UNK> man has several other problems that really hurt the enjoyment of the movie a big issue i have is with <UNK> <UNK> in the lead he's such a <UNK> ass that not only do i not care about his suffering i actually root for it supporting cast members mary <UNK> and <UNK> allen are almost as bad they're so bland and dull they hardly matter in fact there's very little to get excited about while watching the <UNK> man the soundtrack  not very memorable the look  i would describe much of it as <UNK> the plot  predictable the action  there isn't any overall this is one to avoid br br fortunately i watched the <UNK> man via a copy of the mystery science theater <UNK> episode funny stuff while not an absolute very often the <UNK> the movie  the better the mst3k <UNK> the guys hit almost all of their marks with the <UNK> man i'll give it a very <UNK> 4 5 on my mst3k rating scale
len: 2635

После обработки в начало отзыва добавилось необходимое количество токенов , чтобы отзыв был длинной в 500 индексов.

8) Вывели предобработанные массивы обучающих и тестовых данных и их размерности.

# вывод данных
print('X train: \n',X_train)
print('X train: \n',X_test)

# вывод размерностей
print('Shape of X train:', X_train.shape)
print('Shape of X test:', X_test.shape)

X train: 
 [[   0    0    0 ...    6    2    2]
 [   0    0    0 ...   10   10    2]
 [   1   14   22 ...  171  153  303]
 ...
 [   0    0    0 ...   17 2199 1262]
 [   0    0    0 ...  606    5 1356]
 [   0    0    0 ... 1026    5  804]]
X train: 
 [[   0    0    0 ...   10   10    2]
 [   0    0    0 ...   43 1044  710]
 [   0    0    0 ...   35  744   23]
 ...
 [   0    0    0 ...  184 1543  616]
 [   0    0    0 ...   38    2   78]
 [   0    0    0 ...    5    2    2]]
Shape of X train: (25000, 500)
Shape of X test: (25000, 500)

9) Реализовали модель рекуррентной нейронной сети, состоящей из слоев Embedding, LSTM, Dropout, Dense, и обучили ее на обучающих данных с выделением части обучающих данных в качестве валидационных. Вывели информацию об архитектуре нейронной сети. Добились качества обучения по метрике accuracy не менее 0.8.

embed_dim = 32
lstm_units = 64

model = Sequential()
model.add(layers.Embedding(input_dim=vocabulary_size, output_dim=embed_dim, input_length=max_words, input_shape=(max_words,)))
model.add(layers.LSTM(lstm_units))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(1, activation='sigmoid'))

model.summary()

Model: "sequential"

Layer (type)	Output Shape	Param #
embedding_4 (Embedding)	(None, 500, 32)	160,000
lstm_4 (LSTM)	(None, 64)	24,832
dropout_4 (Dropout)	(None, 64)	0
dense_4 (Dense)	(None, 1)	65

Total params: 184,897 (722.25 KB) Trainable params: 184,897 (722.25 KB) Non-trainable params: 0 (0.00 B)

# компилируем и обучаем модель
batch_size = 64
epochs = 3
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
model.fit(X_train, y_train, batch_size=batch_size, epochs=epochs, validation_split=0.2)

Epoch 1/3
313/313 ━━━━━━━━━━━━━━━━━━━━ 13s 23ms/step - accuracy: 0.6613 - loss: 0.5831 - val_accuracy: 0.8470 - val_loss: 0.3631
Epoch 2/3
313/313 ━━━━━━━━━━━━━━━━━━━━ 18s 25ms/step - accuracy: 0.8749 - loss: 0.3133 - val_accuracy: 0.7728 - val_loss: 0.5550
Epoch 3/3
313/313 ━━━━━━━━━━━━━━━━━━━━ 7s 21ms/step - accuracy: 0.8655 - loss: 0.3285 - val_accuracy: 0.8696 - val_loss: 0.3508
<keras.src.callbacks.history.History at 0x7a8f3a94e2a0>

test_loss, test_acc = model.evaluate(X_test, y_test)
print(f"\nTest accuracy: {test_acc}")

782/782 ━━━━━━━━━━━━━━━━━━━━ 9s 11ms/step - accuracy: 0.8611 - loss: 0.3604

Test accuracy: 0.8602399826049805

10) Оценили качество обучения на тестовых данных:

- вывели значение метрики качества классификации на тестовых данных

- вывели отчет о качестве классификации тестовой выборки

- построили ROC-кривую по результату обработки тестовой выборки и вычислили площадь под ROC-кривой (AUC ROC)

#значение метрики качества классификации на тестовых данных
print(f"\nTest accuracy: {test_acc}")

Test accuracy: 0.8602399826049805

#отчет о качестве классификации тестовой выборки
y_score = model.predict(X_test)
y_pred = [1 if y_score[i,0]>=0.5 else 0 for i in range(len(y_score))]

from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred, labels = [0, 1], target_names=['Negative', 'Positive']))

              precision    recall  f1-score   support

    Negative       0.82      0.92      0.87     12500
    Positive       0.91      0.80      0.85     12500

    accuracy                           0.86     25000
   macro avg       0.86      0.86      0.86     25000
weighted avg       0.86      0.86      0.86     25000

#построение ROC-кривой и AUC ROC
from sklearn.metrics import roc_curve, auc

fpr, tpr, thresholds = roc_curve(y_test, y_score)
plt.plot(fpr, tpr)
plt.grid()
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.show()
print('AUC ROC:', auc(fpr, tpr))

AUC ROC: 0.9378295648

11) Сделали выводы по результатам применения рекуррентной нейронной сети для решения задачи определения тональности текста.

Таблица1:

Модель	Количество настраиваемых параметров	Количество эпох обучения	Качество классификации тестовой выборки
Рекуррентная	184 897	3	accuracy:0.860 ; loss:0.3604 ; AUC ROC:0.9378

По полученной таблице можно сделать вывод о хорошей способности рекуррентной нейронной сети определять тональности текста, это подтверждает показатель accuracy = 0.860, который превышает заданный порог в 0.8

20 KiB Исходник Вина История Unescape Escape