форкнуто от main/is_dnn
Вы не можете выбрать более 25 тем
Темы должны начинаться с буквы или цифры, могут содержать дефисы(-) и должны содержать не более 35 символов.
20 KiB
20 KiB
Отчёт по лабораторной работе №4
Ли Тэ Хо, Синявский Степан — А-02-22
Бригада 3
Задание 1
1) В среде Google Colab создали новый блокнот (notebook). Импортировали необходимые для работы библиотеки и модули. Настроили блокнот для работы с аппаратным ускорителем GPU.
# импорт модулей
import os
os.chdir('/content/drive/MyDrive/Colab Notebooks/is_lab4')
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))
Found GPU at: /device:GPU:0
2) Загрузили набор данных IMDb, содержащий оцифрованные отзывы на фильмы, размеченные на два класса: позитивные и негативные. При загрузке набора данных параметр seed выбрали равным значению (4k – 1)=11, где k=3 – номер бригады. Вывели размеры полученных обучающих и тестовых массивов данных.
# загрузка датасета
from keras.datasets import imdb
vocabulary_size = 5000
index_from = 3
(X_train, y_train), (X_test, y_test) = imdb.load_data(
path="imdb.npz",
num_words=vocabulary_size,
skip_top=0,
maxlen=None,
seed=11,
start_char=1,
oov_char=2,
index_from=index_from
)
# вывод размерностей
print('Shape of X train:', X_train.shape)
print('Shape of y train:', y_train.shape)
print('Shape of X test:', X_test.shape)
print('Shape of y test:', y_test.shape)
Shape of X train: (25000,)
Shape of y train: (25000,)
Shape of X test: (25000,)
Shape of y test: (25000,)
3) Вывели один отзыв из обучающего множества в виде списка индексов слов. Преобразовали список индексов в текст и вывели отзыв в виде текста. Вывели длину отзыва. Вывели метку класса данного отзыва и название класса (1 – Positive, 0 – Negative).
# создание словаря для перевода индексов в слова
# заргузка словаря "слово:индекс"
word_to_id = imdb.get_word_index()
# уточнение словаря
word_to_id = {key:(value + index_from) for key,value in word_to_id.items()}
word_to_id["<PAD>"] = 0
word_to_id["<START>"] = 1
word_to_id["<UNK>"] = 2
word_to_id["<UNUSED>"] = 3
# создание обратного словаря "индекс:слово"
id_to_word = {value:key for key,value in word_to_id.items()}
print(X_train[26])
print('len:',len(X_train[26]))
[1, 2489, 723, 2, 9, 399, 2301, 11, 551, 2, 29, 47, 1391, 6, 1692, 15, 29, 70, 361, 8, 97, 35, 3258, 40, 6, 2, 106, 42, 2, 4298, 64, 8, 28, 15, 3258, 796, 2, 11, 6, 275, 1622, 21, 50, 26, 148, 33, 27, 2301, 2, 15, 81, 24, 40, 42, 2, 7, 27, 4646, 5, 80, 81, 845, 12, 304, 8, 67, 15, 29, 152, 3115, 103, 6, 1196, 2, 15, 238, 28, 1894, 27, 2, 2489, 2, 1068, 8, 2181, 27, 1692, 23, 309, 17, 873, 183, 140, 2357, 355, 5, 29, 9, 2, 83, 6, 2699, 1765, 2, 625, 2691, 1229, 80, 516, 10, 10, 11, 2, 279, 12, 286, 141, 6, 52, 326, 8, 796, 106, 4, 2, 132, 11, 4, 172, 1269, 13, 296, 4, 2223, 994, 7, 4, 2223, 5, 3176, 7, 4, 2223, 50, 186, 8, 30, 64, 38, 111, 102, 44, 551, 2, 5, 4, 4616, 3388, 302, 12, 70, 28, 23, 4, 406, 648, 15, 31, 415, 144, 30, 93, 8, 4325, 11, 6, 289, 42, 689, 251, 810, 146, 24, 252, 51, 148, 1893, 18, 4, 20, 1029, 17, 68, 2436, 819, 18, 4, 2, 132, 21, 76, 7, 12, 9, 38, 729, 8, 4, 2223, 102, 15, 12, 566, 30, 2691, 2, 190, 4, 2, 132, 218, 60, 754, 17, 52, 17, 4, 249, 7, 4, 2223, 2355, 10, 10, 1371, 112, 1905, 4981, 4, 2, 132, 47, 450, 85, 712, 15, 66, 1487, 4, 3129, 7, 4, 20, 6, 194, 1834, 13, 28, 9, 19, 2, 2, 11, 4, 485, 240, 141, 6, 2, 1995, 15, 24, 64, 81, 13, 24, 459, 44, 27, 2073, 13, 165, 3663, 18, 12, 696, 177, 1066, 1083, 2, 5, 2, 1602, 26, 220, 17, 78, 507, 38, 1904, 5, 753, 36, 983, 551, 11, 192, 225, 55, 117, 8, 79, 2229, 44, 137, 149, 4, 2, 132, 4, 816, 475, 24, 55, 906, 4, 168, 475, 13, 62, 1634, 76, 7, 12, 17, 2, 4, 114, 475, 727, 4, 206, 475, 50, 218, 101, 444, 14, 9, 31, 8, 798, 10, 10, 2994, 13, 296, 4, 2, 132, 2864, 6, 1039, 7, 4, 736, 1067, 750, 2, 390, 163, 538, 137, 24, 35, 1557, 55, 400, 4, 2, 4, 20, 475, 4, 128, 4, 3179, 2, 4, 493, 569, 220, 32, 7, 68, 3734, 19, 4, 2, 132, 637, 202, 12, 6, 55, 2, 470, 457, 23, 61, 3179, 675, 2407]
len: 413
review_as_text = ' '.join(id_to_word[id] for id in X_train[26])
print(review_as_text)
print('len:',len(review_as_text))
<START> professor paul <UNK> is doing research in matter <UNK> he has developed a machine that he can use to make an object like a <UNK> watch or <UNK> disappear only to have that object re <UNK> in a different location but there are those at his research <UNK> that do not like or <UNK> of his experiments and will do whatever it takes to see that he doesn't succeed after a failed <UNK> that might have saved his <UNK> professor <UNK> decides to test his machine on himself as expected things go horribly wrong and he is <UNK> into a heavily scared <UNK> whose mere touch will kill br br in <UNK> maybe it wasn't such a good idea to re watch the <UNK> man in the same week i watched the fly return of the fly and curse of the fly there seems to be only so many movies about matter <UNK> and the potentially horrendous effects it can have on the human body that one person should be made to endure in a three or four day period i'm not sure what those responsible for the movie list as their source material for the <UNK> man but much of it is so similar to the fly movies that it cannot be mere <UNK> however the <UNK> man isn't even nearly as good as the worst of the fly trilogy br br besides being terribly unoriginal the <UNK> man has several other problems that really hurt the enjoyment of the movie a big issue i have is with <UNK> <UNK> in the lead he's such a <UNK> ass that not only do i not care about his suffering i actually root for it supporting cast members mary <UNK> and <UNK> allen are almost as bad they're so bland and dull they hardly matter in fact there's very little to get excited about while watching the <UNK> man the soundtrack not very memorable the look i would describe much of it as <UNK> the plot predictable the action there isn't any overall this is one to avoid br br fortunately i watched the <UNK> man via a copy of the mystery science theater <UNK> episode funny stuff while not an absolute very often the <UNK> the movie the better the mst3k <UNK> the guys hit almost all of their marks with the <UNK> man i'll give it a very <UNK> 4 5 on my mst3k rating scale
len: 2113
4) Вывели максимальную и минимальную длину отзыва в обучающем множестве.
print('MAX Len: ',len(max(X_train, key=len)))
print('MIN Len: ',len(min(X_train, key=len)))
MAX Len: 2494
MIN Len: 11
5) Провели предобработку данных. Выбрали единую длину, к которой будут приведены все отзывы. Короткие отзывы дополнили спецсимволами, а длинные обрезали до выбранной длины.
# предобработка данных
from tensorflow.keras.utils import pad_sequences
max_words = 500
X_train = pad_sequences(X_train, maxlen=max_words, value=0, padding='pre', truncating='post')
X_test = pad_sequences(X_test, maxlen=max_words, value=0, padding='pre', truncating='post')
6) Повторили пункт 4.
print('MAX Len: ',len(max(X_train, key=len)))
print('MIN Len: ',len(min(X_train, key=len)))
MAX Len: 500
MIN Len: 500
7) Повторили пункт 3. Сделали вывод о том, как отзыв преобразовался после предобработки.
print(X_train[26])
print('len:',len(X_train[26]))
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 1 2489 723 2 9 399 2301 11 551 2 29
47 1391 6 1692 15 29 70 361 8 97 35 3258 40 6
2 106 42 2 4298 64 8 28 15 3258 796 2 11 6
275 1622 21 50 26 148 33 27 2301 2 15 81 24 40
42 2 7 27 4646 5 80 81 845 12 304 8 67 15
29 152 3115 103 6 1196 2 15 238 28 1894 27 2 2489
2 1068 8 2181 27 1692 23 309 17 873 183 140 2357 355
5 29 9 2 83 6 2699 1765 2 625 2691 1229 80 516
10 10 11 2 279 12 286 141 6 52 326 8 796 106
4 2 132 11 4 172 1269 13 296 4 2223 994 7 4
2223 5 3176 7 4 2223 50 186 8 30 64 38 111 102
44 551 2 5 4 4616 3388 302 12 70 28 23 4 406
648 15 31 415 144 30 93 8 4325 11 6 289 42 689
251 810 146 24 252 51 148 1893 18 4 20 1029 17 68
2436 819 18 4 2 132 21 76 7 12 9 38 729 8
4 2223 102 15 12 566 30 2691 2 190 4 2 132 218
60 754 17 52 17 4 249 7 4 2223 2355 10 10 1371
112 1905 4981 4 2 132 47 450 85 712 15 66 1487 4
3129 7 4 20 6 194 1834 13 28 9 19 2 2 11
4 485 240 141 6 2 1995 15 24 64 81 13 24 459
44 27 2073 13 165 3663 18 12 696 177 1066 1083 2 5
2 1602 26 220 17 78 507 38 1904 5 753 36 983 551
11 192 225 55 117 8 79 2229 44 137 149 4 2 132
4 816 475 24 55 906 4 168 475 13 62 1634 76 7
12 17 2 4 114 475 727 4 206 475 50 218 101 444
14 9 31 8 798 10 10 2994 13 296 4 2 132 2864
6 1039 7 4 736 1067 750 2 390 163 538 137 24 35
1557 55 400 4 2 4 20 475 4 128 4 3179 2 4
493 569 220 32 7 68 3734 19 4 2 132 637 202 12
6 55 2 470 457 23 61 3179 675 2407]
len: 500
review_as_text = ' '.join(id_to_word[id] for id in X_train[26])
print(review_as_text)
print('len:',len(review_as_text))
professor paul <UNK> is doing research in matter <UNK> he has developed a machine that he can use to make an object like a <UNK> watch or <UNK> disappear only to have that object re <UNK> in a different location but there are those at his research <UNK> that do not like or <UNK> of his experiments and will do whatever it takes to see that he doesn't succeed after a failed <UNK> that might have saved his <UNK> professor <UNK> decides to test his machine on himself as expected things go horribly wrong and he is <UNK> into a heavily scared <UNK> whose mere touch will kill br br in <UNK> maybe it wasn't such a good idea to re watch the <UNK> man in the same week i watched the fly return of the fly and curse of the fly there seems to be only so many movies about matter <UNK> and the potentially horrendous effects it can have on the human body that one person should be made to endure in a three or four day period i'm not sure what those responsible for the movie list as their source material for the <UNK> man but much of it is so similar to the fly movies that it cannot be mere <UNK> however the <UNK> man isn't even nearly as good as the worst of the fly trilogy br br besides being terribly unoriginal the <UNK> man has several other problems that really hurt the enjoyment of the movie a big issue i have is with <UNK> <UNK> in the lead he's such a <UNK> ass that not only do i not care about his suffering i actually root for it supporting cast members mary <UNK> and <UNK> allen are almost as bad they're so bland and dull they hardly matter in fact there's very little to get excited about while watching the <UNK> man the soundtrack not very memorable the look i would describe much of it as <UNK> the plot predictable the action there isn't any overall this is one to avoid br br fortunately i watched the <UNK> man via a copy of the mystery science theater <UNK> episode funny stuff while not an absolute very often the <UNK> the movie the better the mst3k <UNK> the guys hit almost all of their marks with the <UNK> man i'll give it a very <UNK> 4 5 on my mst3k rating scale
len: 2635
После обработки в начало отзыва добавилось необходимое количество токенов , чтобы отзыв был длинной в 500 индексов.
8) Вывели предобработанные массивы обучающих и тестовых данных и их размерности.
# вывод данных
print('X train: \n',X_train)
print('X train: \n',X_test)
# вывод размерностей
print('Shape of X train:', X_train.shape)
print('Shape of X test:', X_test.shape)
X train:
[[ 0 0 0 ... 6 2 2]
[ 0 0 0 ... 10 10 2]
[ 1 14 22 ... 171 153 303]
...
[ 0 0 0 ... 17 2199 1262]
[ 0 0 0 ... 606 5 1356]
[ 0 0 0 ... 1026 5 804]]
X train:
[[ 0 0 0 ... 10 10 2]
[ 0 0 0 ... 43 1044 710]
[ 0 0 0 ... 35 744 23]
...
[ 0 0 0 ... 184 1543 616]
[ 0 0 0 ... 38 2 78]
[ 0 0 0 ... 5 2 2]]
Shape of X train: (25000, 500)
Shape of X test: (25000, 500)
9) Реализовали модель рекуррентной нейронной сети, состоящей из слоев Embedding, LSTM, Dropout, Dense, и обучили ее на обучающих данных с выделением части обучающих данных в качестве валидационных. Вывели информацию об архитектуре нейронной сети. Добились качества обучения по метрике accuracy не менее 0.8.
embed_dim = 32
lstm_units = 64
model = Sequential()
model.add(layers.Embedding(input_dim=vocabulary_size, output_dim=embed_dim, input_length=max_words, input_shape=(max_words,)))
model.add(layers.LSTM(lstm_units))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(1, activation='sigmoid'))
model.summary()
Model: "sequential"
| Layer (type) | Output Shape | Param # |
|---|---|---|
| embedding_4 (Embedding) | (None, 500, 32) | 160,000 |
| lstm_4 (LSTM) | (None, 64) | 24,832 |
| dropout_4 (Dropout) | (None, 64) | 0 |
| dense_4 (Dense) | (None, 1) | 65 |
Total params: 184,897 (722.25 KB) Trainable params: 184,897 (722.25 KB) Non-trainable params: 0 (0.00 B)
# компилируем и обучаем модель
batch_size = 64
epochs = 3
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
model.fit(X_train, y_train, batch_size=batch_size, epochs=epochs, validation_split=0.2)
Epoch 1/3
313/313 ━━━━━━━━━━━━━━━━━━━━ 13s 23ms/step - accuracy: 0.6613 - loss: 0.5831 - val_accuracy: 0.8470 - val_loss: 0.3631
Epoch 2/3
313/313 ━━━━━━━━━━━━━━━━━━━━ 18s 25ms/step - accuracy: 0.8749 - loss: 0.3133 - val_accuracy: 0.7728 - val_loss: 0.5550
Epoch 3/3
313/313 ━━━━━━━━━━━━━━━━━━━━ 7s 21ms/step - accuracy: 0.8655 - loss: 0.3285 - val_accuracy: 0.8696 - val_loss: 0.3508
<keras.src.callbacks.history.History at 0x7a8f3a94e2a0>
test_loss, test_acc = model.evaluate(X_test, y_test)
print(f"\nTest accuracy: {test_acc}")
782/782 ━━━━━━━━━━━━━━━━━━━━ 9s 11ms/step - accuracy: 0.8611 - loss: 0.3604
Test accuracy: 0.8602399826049805
10) Оценили качество обучения на тестовых данных:
- вывели значение метрики качества классификации на тестовых данных
- вывели отчет о качестве классификации тестовой выборки
- построили ROC-кривую по результату обработки тестовой выборки и вычислили площадь под ROC-кривой (AUC ROC)
#значение метрики качества классификации на тестовых данных
print(f"\nTest accuracy: {test_acc}")
Test accuracy: 0.8602399826049805
#отчет о качестве классификации тестовой выборки
y_score = model.predict(X_test)
y_pred = [1 if y_score[i,0]>=0.5 else 0 for i in range(len(y_score))]
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred, labels = [0, 1], target_names=['Negative', 'Positive']))
precision recall f1-score support
Negative 0.82 0.92 0.87 12500
Positive 0.91 0.80 0.85 12500
accuracy 0.86 25000
macro avg 0.86 0.86 0.86 25000
weighted avg 0.86 0.86 0.86 25000
#построение ROC-кривой и AUC ROC
from sklearn.metrics import roc_curve, auc
fpr, tpr, thresholds = roc_curve(y_test, y_score)
plt.plot(fpr, tpr)
plt.grid()
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.show()
print('AUC ROC:', auc(fpr, tpr))
AUC ROC: 0.9378295648
11) Сделали выводы по результатам применения рекуррентной нейронной сети для решения задачи определения тональности текста.
Таблица1:
| Модель | Количество настраиваемых параметров | Количество эпох обучения | Качество классификации тестовой выборки |
|---|---|---|---|
| Рекуррентная | 184 897 | 3 | accuracy:0.860 ; loss:0.3604 ; AUC ROC:0.9378 |
