Добавление отчета, блокнота и изображений

2 месяцев назад · 962a2015de
--- a/labworks/LW4/1_p10.png
+++ b/labworks/LW4/1_p10.png
--- a/labworks/LW4/1_p9.png
+++ b/labworks/LW4/1_p9.png
--- a/labworks/LW4/is_lab4.ipynb
+++ b/labworks/LW4/is_lab4.ipynb
--- a/labworks/LW4/report.md
+++ b/labworks/LW4/report.md
@ -0,0 +1,325 @@
+# Отчёт по лабораторной работе №4
+Артюшина Валерия, Хохлов Кирилл, А-01-22
+
+# Задание 1
+
+## 1) В среде Google Colab создали новый блокнот (notebook). Импортировали необходимые для работы библиотеки и модули. Настроили блокнот для работы с аппаратным ускорителем GPU.
+
+```python
+# импорт модулей
+import os
+os.mkdir('/content/drive/MyDrive/Colab Notebooks/is_lab4')
+os.chdir('/content/drive/MyDrive/Colab Notebooks/is_lab4')
+
+from tensorflow import keras
+from tensorflow.keras import layers
+from tensorflow.keras.models import Sequential
+import matplotlib.pyplot as plt
+import numpy as np
+```
+```python
+import tensorflow as tf
+device_name = tf.test.gpu_device_name()
+if device_name != '/device:GPU:0':
+  raise SystemError('GPU device not found')
+print('Found GPU at: {}'.format(device_name))
+```
+```
+Found GPU at: /device:GPU:0
+```
+
+## 2) Загрузили набор данных IMDb, содержащий оцифрованные отзывы на фильмы, размеченные на два класса: позитивные и негативные. При загрузке набора данных параметр seed выбрали равным значению (4k – 1)=15, где k=4 – номер бригады. Вывели размеры полученных обучающих и тестовых массивов данных.
+
+```python
+# загрузка датасета
+from keras.datasets import imdb
+
+vocabulary_size = 5000
+index_from = 3
+
+(X_train, y_train), (X_test, y_test) = imdb.load_data(
+    path="imdb.npz",
+    num_words=vocabulary_size,
+    skip_top=0,
+    maxlen=None,
+    seed=15,
+    start_char=1,
+    oov_char=2,
+    index_from=index_from
+    )
+
+# вывод размерностей
+print('Shape of X train:', X_train.shape)
+print('Shape of y train:', y_train.shape)
+print('Shape of X test:', X_test.shape)
+print('Shape of y test:', y_test.shape)
+```
+```
+Shape of X train: (25000,)
+Shape of y train: (25000,)
+Shape of X test: (25000,)
+Shape of y test: (25000,)
+```
+
+## 3) Вывели один отзыв из обучающего множества в виде списка индексов слов. Преобразовали список индексов в текст и вывели отзыв в виде текста. Вывели длину отзыва. Вывели метку класса данного отзыва и название класса (1 – Positive, 0 – Negative).
+
+```python
+# создание словаря для перевода индексов в слова
+# заргузка словаря "слово:индекс"
+word_to_id = imdb.get_word_index()
+# уточнение словаря
+word_to_id = {key:(value + index_from) for key,value in word_to_id.items()}
+word_to_id["<PAD>"] = 0
+word_to_id["<START>"] = 1
+word_to_id["<UNK>"] = 2
+word_to_id["<UNUSED>"] = 3
+# создание обратного словаря "индекс:слово"
+id_to_word = {value:key for key,value in word_to_id.items()}
+```
+```python
+print(X_train[26])
+print('len:',len(X_train[26]))
+```
+```
+[1, 608, 17, 316, 47, 3381, 46, 14, 22, 9, 6, 601, 912, 8, 49, 2461, 14, 9, 88, 12, 16, 6, 2207, 2, 22, 15, 69, 6, 176, 7, 819, 2, 42, 2, 180, 8, 751, 2, 8, 1090, 4, 2, 1745, 675, 21, 4, 22, 47, 111, 85, 1508, 17, 73, 10, 10, 8, 895, 19, 4, 2, 186, 8, 28, 188, 27, 2, 5, 2109, 1849, 56, 4, 2, 11, 14, 22, 26, 2, 5, 92, 40, 3390, 21, 11, 175, 85, 1161, 36, 4486, 40, 2109, 150, 25, 43, 191, 81, 15, 19, 6, 2136, 512, 509, 874, 188, 8, 1231, 8, 4, 2269, 7, 4, 512, 42, 4, 451, 79, 32, 1471, 5, 3222, 34, 2, 2793, 11, 4, 355, 155, 11, 192, 4, 226, 1499, 5, 862, 1353, 114, 9, 142, 15, 47, 460, 77, 224, 18, 2109, 21, 152, 97, 101, 281, 11, 6, 1985, 20, 10, 10, 4129, 4, 1985, 1352, 26, 4, 2, 25, 28, 126, 110, 1814, 11, 4, 1985, 20, 970, 3882, 8, 124, 15, 4, 1985, 1352, 5, 2, 26, 142, 4, 451, 2, 2, 246, 49, 7, 134, 2, 26, 43, 1044, 2968, 10, 10, 50, 26, 6, 378, 7, 1076, 52, 1801, 13, 165, 179, 423, 4, 603, 409, 28, 1046, 2, 2, 2, 5, 10, 10, 1361, 48, 141, 6, 155, 70, 1778, 10, 10, 13, 82, 179, 423, 4, 1347, 18, 2, 4, 2, 2, 50, 26, 38, 111, 189, 102, 15, 2, 23, 105, 2, 2, 21, 11, 14, 420, 36, 86, 2, 6, 55, 2, 5, 1134, 1210, 1985, 2, 5, 140, 4682, 4, 1939, 13, 384, 25, 70, 516, 2, 19, 3390, 3589, 5, 75, 28, 49, 184, 976, 2, 134, 504, 1616, 30, 99, 254, 8, 276, 107, 5, 107, 295, 2, 21, 11, 801, 405, 14, 20, 271, 120, 4, 350, 5, 1608, 49, 85, 55, 2, 5, 1139, 1210, 2, 2872]
+len: 323
+```
+```python
+review_as_text = ' '.join(id_to_word[id] for id in X_train[26])
+print(review_as_text)
+print('len:',len(review_as_text))
+```
+```
+<START> ok as everyone has pointed out this film is a complete dog to some degree this is because it was a gory <UNK> film that had a lot of material <UNK> or <UNK> down to near <UNK> to escape the <UNK> x rating but the film has many other flaws as well br br to begin with the <UNK> seems to have got his <UNK> and vampires mixed up the <UNK> in this film are <UNK> and don't like silver but in every other respect they behave like vampires now you just can't do that with a crappy genre flick you've got to stick to the rules of the genre or the fans get all confused and annoyed by <UNK> disbelief in the wrong thing in fact the whole confusing and poorly presented plot is something that has already been done for vampires but doesn't make any sense in a werewolf movie br br secondly the werewolf costumes are the <UNK> you have ever seen anybody in the werewolf movie business ought to know that the werewolf costumes and <UNK> are something the fans <UNK> <UNK> yet some of these <UNK> are just plain goofy br br there are a couple of slightly good bits i actually quite liked the score others have mentioned <UNK> <UNK> <UNK> and br br spoiler if such a thing can exist br br i also quite liked the plan for <UNK> the <UNK> <UNK> there are so many horror movies that <UNK> on characters <UNK> <UNK> but in this case they first <UNK> a very <UNK> and effective anti werewolf <UNK> and go slaughter the monsters i mean you can kill <UNK> with silver bullets and we have some pretty powerful <UNK> these days shouldn't be too hard to put two and two together <UNK> but in typical style this movie goes over the top and adds some other very <UNK> and amusing anti <UNK> weapons
+len: 1682
+```
+
+
+## 4) Вывели максимальную и минимальную длину отзыва в обучающем множестве.
+
+```python
+print('MAX Len: ',len(max(X_train, key=len)))
+print('MIN Len: ',len(min(X_train, key=len)))
+```
+```
+MAX Len:  2494
+MIN Len:  11
+```
+
+## 5) Провели предобработку данных. Выбрали единую длину, к которой будут приведены все отзывы. Короткие отзывы дополнили спецсимволами, а длинные обрезали до выбранной длины.
+
+```python
+# предобработка данных
+from tensorflow.keras.utils import pad_sequences
+max_words = 500
+X_train = pad_sequences(X_train, maxlen=max_words, value=0, padding='pre', truncating='post')
+X_test = pad_sequences(X_test, maxlen=max_words, value=0, padding='pre', truncating='post')
+```
+
+## 6) Повторили пункт 4.
+
+```python
+print('MAX Len: ',len(max(X_train, key=len)))
+print('MIN Len: ',len(min(X_train, key=len)))
+```
+```
+MAX Len:  500
+MIN Len:  500
+```
+
+## 7) Повторили пункт 3. Сделали вывод о том, как отзыв преобразовался после предобработки.
+```python
+print(X_train[26])
+print('len:',len(X_train[26]))
+```
+```
+[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
+    0    0    0    0    0    0    0    0    0    0    0    0    0    0
+    0    0    0    0    0    0    0    0    0    0    0    0    0    0
+    0    0    0    0    0    0    0    0    0    0    0    0    0    0
+    0    0    0    0    0    0    0    0    0    0    0    0    0    0
+    0    0    0    0    0    0    0    0    0    0    0    0    0    0
+    0    0    0    0    0    0    0    0    0    0    0    0    0    0
+    0    0    0    0    0    0    0    0    0    0    0    0    0    0
+    0    0    0    0    0    0    0    0    0    0    0    0    0    0
+    0    0    0    0    0    0    0    0    0    0    0    0    0    0
+    0    0    0    0    0    0    0    0    0    0    0    0    0    0
+    0    0    0    0    0    0    0    0    0    0    0    0    0    0
+    0    0    0    0    0    0    0    0    0    1  608   17  316   47
+ 3381   46   14   22    9    6  601  912    8   49 2461   14    9   88
+   12   16    6 2207    2   22   15   69    6  176    7  819    2   42
+    2  180    8  751    2    8 1090    4    2 1745  675   21    4   22
+   47  111   85 1508   17   73   10   10    8  895   19    4    2  186
+    8   28  188   27    2    5 2109 1849   56    4    2   11   14   22
+   26    2    5   92   40 3390   21   11  175   85 1161   36 4486   40
+ 2109  150   25   43  191   81   15   19    6 2136  512  509  874  188
+    8 1231    8    4 2269    7    4  512   42    4  451   79   32 1471
+    5 3222   34    2 2793   11    4  355  155   11  192    4  226 1499
+    5  862 1353  114    9  142   15   47  460   77  224   18 2109   21
+  152   97  101  281   11    6 1985   20   10   10 4129    4 1985 1352
+   26    4    2   25   28  126  110 1814   11    4 1985   20  970 3882
+    8  124   15    4 1985 1352    5    2   26  142    4  451    2    2
+  246   49    7  134    2   26   43 1044 2968   10   10   50   26    6
+  378    7 1076   52 1801   13  165  179  423    4  603  409   28 1046
+    2    2    2    5   10   10 1361   48  141    6  155   70 1778   10
+   10   13   82  179  423    4 1347   18    2    4    2    2   50   26
+   38  111  189  102   15    2   23  105    2    2   21   11   14  420
+   36   86    2    6   55    2    5 1134 1210 1985    2    5  140 4682
+    4 1939   13  384   25   70  516    2   19 3390 3589    5   75   28
+   49  184  976    2  134  504 1616   30   99  254    8  276  107    5
+  107  295    2   21   11  801  405   14   20  271  120    4  350    5
+ 1608   49   85   55    2    5 1139 1210    2 2872]
+len: 500
+```
+
+```python
+review_as_text = ' '.join(id_to_word[id] for id in X_train[26])
+print(review_as_text)
+print('len:',len(review_as_text))
+```
+```
+<PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <START> ok as everyone has pointed out this film is a complete dog to some degree this is because it was a gory <UNK> film that had a lot of material <UNK> or <UNK> down to near <UNK> to escape the <UNK> x rating but the film has many other flaws as well br br to begin with the <UNK> seems to have got his <UNK> and vampires mixed up the <UNK> in this film are <UNK> and don't like silver but in every other respect they behave like vampires now you just can't do that with a crappy genre flick you've got to stick to the rules of the genre or the fans get all confused and annoyed by <UNK> disbelief in the wrong thing in fact the whole confusing and poorly presented plot is something that has already been done for vampires but doesn't make any sense in a werewolf movie br br secondly the werewolf costumes are the <UNK> you have ever seen anybody in the werewolf movie business ought to know that the werewolf costumes and <UNK> are something the fans <UNK> <UNK> yet some of these <UNK> are just plain goofy br br there are a couple of slightly good bits i actually quite liked the score others have mentioned <UNK> <UNK> <UNK> and br br spoiler if such a thing can exist br br i also quite liked the plan for <UNK> the <UNK> <UNK> there are so many horror movies that <UNK> on characters <UNK> <UNK> but in this case they first <UNK> a very <UNK> and effective anti werewolf <UNK> and go slaughter the monsters i mean you can kill <UNK> with silver bullets and we have some pretty powerful <UNK> these days shouldn't be too hard to put two and two together <UNK> but in typical style this movie goes over the top and adds some other very <UNK> and amusing anti <UNK> weapons
+len: 2744
+```
+После обработки в начало отзыва добавилось необходимое количество токенов <PAD>, чтобы отзыв был длинной в 500 индексов.
+
+
+## 8) Вывели предобработанные массивы обучающих и тестовых данных и их размерности.
+
+```python
+# вывод данных
+print('X train: \n',X_train)
+print('X train: \n',X_test)
+
+# вывод размерностей
+print('Shape of X train:', X_train.shape)
+print('Shape of X test:', X_test.shape)
+```
+```
+X train: 
+ [[   0    0    0 ...    4   86  273]
+ [   0    0    0 ...  705    9  150]
+ [   0    0    0 ...   44   12   32]
+ ...
+ [   0    0    0 ...   22    8  377]
+ [   0    0    0 ...    4 2554  647]
+ [   0    0    0 ...    2    4    2]]
+X train: 
+ [[  0   0   0 ... 106  14  31]
+ [  0   0   0 ... 458 168  52]
+ [  0   0   0 ...  22   6  31]
+ ...
+ [  0   0   0 ...  38  76 128]
+ [  0   0   0 ...  73 290  12]
+ [  0   0   0 ...  12  38  76]]
+Shape of X train: (25000, 500)
+Shape of X test: (25000, 500)
+```
+
+## 9) Реализовали модель рекуррентной нейронной сети, состоящей из слоев Embedding, LSTM, Dropout, Dense, и обучили ее на обучающих данных с выделением части обучающих данных в качестве валидационных. Вывели информацию об архитектуре нейронной сети. Добились качества обучения по метрике accuracy не менее 0.8.
+
+```python
+embed_dim = 32
+lstm_units = 64
+
+model = Sequential()
+model.add(layers.Embedding(input_dim=vocabulary_size, output_dim=embed_dim, input_length=max_words, input_shape=(max_words,)))
+model.add(layers.LSTM(lstm_units))
+model.add(layers.Dropout(0.5))
+model.add(layers.Dense(1, activation='sigmoid'))
+
+model.summary()
+```
+![alt text](1_p9.png)
+
+```python
+# компилируем и обучаем модель
+batch_size = 64
+epochs = 3
+model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
+model.fit(X_train, y_train, batch_size=batch_size, epochs=epochs, validation_split=0.2)
+```
+```
+Epoch 1/3
+313/313 ━━━━━━━━━━━━━━━━━━━━ 12s 23ms/step - accuracy: 0.6328 - loss: 0.6075 - val_accuracy: 0.7826 - val_loss: 0.4588
+Epoch 2/3
+313/313 ━━━━━━━━━━━━━━━━━━━━ 6s 21ms/step - accuracy: 0.8121 - loss: 0.4143 - val_accuracy: 0.8628 - val_loss: 0.3359
+Epoch 3/3
+313/313 ━━━━━━━━━━━━━━━━━━━━ 10s 21ms/step - accuracy: 0.8905 - loss: 0.2795 - val_accuracy: 0.8506 - val_loss: 0.3324
+<keras.src.callbacks.history.History at 0x7a8b2b9509e0>
+```
+```python
+test_loss, test_acc = model.evaluate(X_test, y_test)
+print(f"\nTest accuracy: {test_acc}")
+print(f"\nTest loss: {test_loss}")
+```
+```
+accuracy: 0.8544 
+loss: 0.3396
+
+Test accuracy: 0.8564800024032593
+
+Test loss: 0.33131280541419983
+```
+
+## 10) Оценили качество обучения на тестовых данных.
+* вывели значение метрики качества классификации на тестовых данных
+* вывели отчет о качестве классификации тестовой выборки  
+* построили ROC-кривую по результату обработки тестовой выборки и вычислили площадь под ROC-кривой (AUC ROC)
+
+```python
+#значение метрики качества классификации на тестовых данных
+print(f"\nTest accuracy: {test_acc}")
+```
+```
+Test accuracy: 0.8564800024032593
+```
+
+```python
+#отчет о качестве классификации тестовой выборки
+y_score = model.predict(X_test)
+y_pred = [1 if y_score[i,0]>=0.5 else 0 for i in range(len(y_score))]
+
+from sklearn.metrics import classification_report
+print(classification_report(y_test, y_pred, labels = [0, 1], target_names=['Negative', 'Positive']))
+```
+```
+              precision    recall  f1-score   support
+
+    Negative       0.83      0.90      0.86     12500
+    Positive       0.89      0.82      0.85     12500
+
+    accuracy                           0.86     25000
+   macro avg       0.86      0.86      0.86     25000
+weighted avg       0.86      0.86      0.86     25000
+```
+
+```python
+#построение ROC-кривой и AUC ROC
+from sklearn.metrics import roc_curve, auc
+
+fpr, tpr, thresholds = roc_curve(y_test, y_score)
+plt.plot(fpr, tpr)
+plt.grid()
+plt.xlabel('False Positive Rate')
+plt.ylabel('True Positive Rate')
+plt.title('ROC')
+plt.show()
+print('AUC ROC:', auc(fpr, tpr))
+```
+![alt text](1_p10.png)
+```
+AUC ROC: 0.9356664896
+```
+
+## 11) Сделали выводы по результатам применения рекуррентной нейронной сети для решения задачи определения тональности текста. 
+
+Таблица1:
+
+| Модель   | Количество настраиваемых параметров | Количество эпох обучения | Качество классификации тестовой выборки |
+|----------|-------------------------------------|---------------------------|-----------------------------------------|
+| Рекуррентная | 184 897                              | 3                        | accuracy:0.8564     ; loss:0.3313     ; AUC ROC:0.9357                                   |
+
+
+## Вывод
+По результатам применения рекуррентной нейронной сети можно сделать вывод, что модель остаточно неплохо справилась с задачей определения тональности текста. Показатель accuracy = 0.8564 превышает требуемый порог 0.8. Значение AUC ROC = 0.9357 (> 0.9) говорит о высокой способности модели различать два класса на положительные и отрицательные области.