# Отчёт по лабораторной работе №4
Артюшина Валерия, Хохлов Кирилл, А-01-22

# Задание 1

## 1) В среде Google Colab создали новый блокнот (notebook). Импортировали необходимые для работы библиотеки и модули. Настроили блокнот для работы с аппаратным ускорителем GPU.

```python
# импорт модулей
import os
os.mkdir('/content/drive/MyDrive/Colab Notebooks/is_lab4')
os.chdir('/content/drive/MyDrive/Colab Notebooks/is_lab4')

from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential
import matplotlib.pyplot as plt
import numpy as np
```
```python
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))
```
```
Found GPU at: /device:GPU:0
```

## 2) Загрузили набор данных IMDb, содержащий оцифрованные отзывы на фильмы, размеченные на два класса: позитивные и негативные. При загрузке набора данных параметр seed выбрали равным значению (4k – 1)=15, где k=4 – номер бригады. Вывели размеры полученных обучающих и тестовых массивов данных.

```python
# загрузка датасета
from keras.datasets import imdb

vocabulary_size = 5000
index_from = 3

(X_train, y_train), (X_test, y_test) = imdb.load_data(
    path="imdb.npz",
    num_words=vocabulary_size,
    skip_top=0,
    maxlen=None,
    seed=15,
    start_char=1,
    oov_char=2,
    index_from=index_from
    )

# вывод размерностей
print('Shape of X train:', X_train.shape)
print('Shape of y train:', y_train.shape)
print('Shape of X test:', X_test.shape)
print('Shape of y test:', y_test.shape)
```
```
Shape of X train: (25000,)
Shape of y train: (25000,)
Shape of X test: (25000,)
Shape of y test: (25000,)
```

## 3) Вывели один отзыв из обучающего множества в виде списка индексов слов. Преобразовали список индексов в текст и вывели отзыв в виде текста. Вывели длину отзыва. Вывели метку класса данного отзыва и название класса (1 – Positive, 0 – Negative).

```python
# создание словаря для перевода индексов в слова
# заргузка словаря "слово:индекс"
word_to_id = imdb.get_word_index()
# уточнение словаря
word_to_id = {key:(value + index_from) for key,value in word_to_id.items()}
word_to_id["<PAD>"] = 0
word_to_id["<START>"] = 1
word_to_id["<UNK>"] = 2
word_to_id["<UNUSED>"] = 3
# создание обратного словаря "индекс:слово"
id_to_word = {value:key for key,value in word_to_id.items()}
```
```python
print(X_train[26])
print('len:',len(X_train[26]))
```
```
[1, 608, 17, 316, 47, 3381, 46, 14, 22, 9, 6, 601, 912, 8, 49, 2461, 14, 9, 88, 12, 16, 6, 2207, 2, 22, 15, 69, 6, 176, 7, 819, 2, 42, 2, 180, 8, 751, 2, 8, 1090, 4, 2, 1745, 675, 21, 4, 22, 47, 111, 85, 1508, 17, 73, 10, 10, 8, 895, 19, 4, 2, 186, 8, 28, 188, 27, 2, 5, 2109, 1849, 56, 4, 2, 11, 14, 22, 26, 2, 5, 92, 40, 3390, 21, 11, 175, 85, 1161, 36, 4486, 40, 2109, 150, 25, 43, 191, 81, 15, 19, 6, 2136, 512, 509, 874, 188, 8, 1231, 8, 4, 2269, 7, 4, 512, 42, 4, 451, 79, 32, 1471, 5, 3222, 34, 2, 2793, 11, 4, 355, 155, 11, 192, 4, 226, 1499, 5, 862, 1353, 114, 9, 142, 15, 47, 460, 77, 224, 18, 2109, 21, 152, 97, 101, 281, 11, 6, 1985, 20, 10, 10, 4129, 4, 1985, 1352, 26, 4, 2, 25, 28, 126, 110, 1814, 11, 4, 1985, 20, 970, 3882, 8, 124, 15, 4, 1985, 1352, 5, 2, 26, 142, 4, 451, 2, 2, 246, 49, 7, 134, 2, 26, 43, 1044, 2968, 10, 10, 50, 26, 6, 378, 7, 1076, 52, 1801, 13, 165, 179, 423, 4, 603, 409, 28, 1046, 2, 2, 2, 5, 10, 10, 1361, 48, 141, 6, 155, 70, 1778, 10, 10, 13, 82, 179, 423, 4, 1347, 18, 2, 4, 2, 2, 50, 26, 38, 111, 189, 102, 15, 2, 23, 105, 2, 2, 21, 11, 14, 420, 36, 86, 2, 6, 55, 2, 5, 1134, 1210, 1985, 2, 5, 140, 4682, 4, 1939, 13, 384, 25, 70, 516, 2, 19, 3390, 3589, 5, 75, 28, 49, 184, 976, 2, 134, 504, 1616, 30, 99, 254, 8, 276, 107, 5, 107, 295, 2, 21, 11, 801, 405, 14, 20, 271, 120, 4, 350, 5, 1608, 49, 85, 55, 2, 5, 1139, 1210, 2, 2872]
len: 323
```
```python
review_as_text = ' '.join(id_to_word[id] for id in X_train[26])
print(review_as_text)
print('len:',len(review_as_text))
```
```
<START> ok as everyone has pointed out this film is a complete dog to some degree this is because it was a gory <UNK> film that had a lot of material <UNK> or <UNK> down to near <UNK> to escape the <UNK> x rating but the film has many other flaws as well br br to begin with the <UNK> seems to have got his <UNK> and vampires mixed up the <UNK> in this film are <UNK> and don't like silver but in every other respect they behave like vampires now you just can't do that with a crappy genre flick you've got to stick to the rules of the genre or the fans get all confused and annoyed by <UNK> disbelief in the wrong thing in fact the whole confusing and poorly presented plot is something that has already been done for vampires but doesn't make any sense in a werewolf movie br br secondly the werewolf costumes are the <UNK> you have ever seen anybody in the werewolf movie business ought to know that the werewolf costumes and <UNK> are something the fans <UNK> <UNK> yet some of these <UNK> are just plain goofy br br there are a couple of slightly good bits i actually quite liked the score others have mentioned <UNK> <UNK> <UNK> and br br spoiler if such a thing can exist br br i also quite liked the plan for <UNK> the <UNK> <UNK> there are so many horror movies that <UNK> on characters <UNK> <UNK> but in this case they first <UNK> a very <UNK> and effective anti werewolf <UNK> and go slaughter the monsters i mean you can kill <UNK> with silver bullets and we have some pretty powerful <UNK> these days shouldn't be too hard to put two and two together <UNK> but in typical style this movie goes over the top and adds some other very <UNK> and amusing anti <UNK> weapons
len: 1682
```


## 4) Вывели максимальную и минимальную длину отзыва в обучающем множестве.

```python
print('MAX Len: ',len(max(X_train, key=len)))
print('MIN Len: ',len(min(X_train, key=len)))
```
```
MAX Len:  2494
MIN Len:  11
```

## 5) Провели предобработку данных. Выбрали единую длину, к которой будут приведены все отзывы. Короткие отзывы дополнили спецсимволами, а длинные обрезали до выбранной длины.

```python
# предобработка данных
from tensorflow.keras.utils import pad_sequences
max_words = 500
X_train = pad_sequences(X_train, maxlen=max_words, value=0, padding='pre', truncating='post')
X_test = pad_sequences(X_test, maxlen=max_words, value=0, padding='pre', truncating='post')
```

## 6) Повторили пункт 4.

```python
print('MAX Len: ',len(max(X_train, key=len)))
print('MIN Len: ',len(min(X_train, key=len)))
```
```
MAX Len:  500
MIN Len:  500
```

## 7) Повторили пункт 3. Сделали вывод о том, как отзыв преобразовался после предобработки.
```python
print(X_train[26])
print('len:',len(X_train[26]))
```
```
[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    1  608   17  316   47
 3381   46   14   22    9    6  601  912    8   49 2461   14    9   88
   12   16    6 2207    2   22   15   69    6  176    7  819    2   42
    2  180    8  751    2    8 1090    4    2 1745  675   21    4   22
   47  111   85 1508   17   73   10   10    8  895   19    4    2  186
    8   28  188   27    2    5 2109 1849   56    4    2   11   14   22
   26    2    5   92   40 3390   21   11  175   85 1161   36 4486   40
 2109  150   25   43  191   81   15   19    6 2136  512  509  874  188
    8 1231    8    4 2269    7    4  512   42    4  451   79   32 1471
    5 3222   34    2 2793   11    4  355  155   11  192    4  226 1499
    5  862 1353  114    9  142   15   47  460   77  224   18 2109   21
  152   97  101  281   11    6 1985   20   10   10 4129    4 1985 1352
   26    4    2   25   28  126  110 1814   11    4 1985   20  970 3882
    8  124   15    4 1985 1352    5    2   26  142    4  451    2    2
  246   49    7  134    2   26   43 1044 2968   10   10   50   26    6
  378    7 1076   52 1801   13  165  179  423    4  603  409   28 1046
    2    2    2    5   10   10 1361   48  141    6  155   70 1778   10
   10   13   82  179  423    4 1347   18    2    4    2    2   50   26
   38  111  189  102   15    2   23  105    2    2   21   11   14  420
   36   86    2    6   55    2    5 1134 1210 1985    2    5  140 4682
    4 1939   13  384   25   70  516    2   19 3390 3589    5   75   28
   49  184  976    2  134  504 1616   30   99  254    8  276  107    5
  107  295    2   21   11  801  405   14   20  271  120    4  350    5
 1608   49   85   55    2    5 1139 1210    2 2872]
len: 500
```

```python
review_as_text = ' '.join(id_to_word[id] for id in X_train[26])
print(review_as_text)
print('len:',len(review_as_text))
```
```
<PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <START> ok as everyone has pointed out this film is a complete dog to some degree this is because it was a gory <UNK> film that had a lot of material <UNK> or <UNK> down to near <UNK> to escape the <UNK> x rating but the film has many other flaws as well br br to begin with the <UNK> seems to have got his <UNK> and vampires mixed up the <UNK> in this film are <UNK> and don't like silver but in every other respect they behave like vampires now you just can't do that with a crappy genre flick you've got to stick to the rules of the genre or the fans get all confused and annoyed by <UNK> disbelief in the wrong thing in fact the whole confusing and poorly presented plot is something that has already been done for vampires but doesn't make any sense in a werewolf movie br br secondly the werewolf costumes are the <UNK> you have ever seen anybody in the werewolf movie business ought to know that the werewolf costumes and <UNK> are something the fans <UNK> <UNK> yet some of these <UNK> are just plain goofy br br there are a couple of slightly good bits i actually quite liked the score others have mentioned <UNK> <UNK> <UNK> and br br spoiler if such a thing can exist br br i also quite liked the plan for <UNK> the <UNK> <UNK> there are so many horror movies that <UNK> on characters <UNK> <UNK> but in this case they first <UNK> a very <UNK> and effective anti werewolf <UNK> and go slaughter the monsters i mean you can kill <UNK> with silver bullets and we have some pretty powerful <UNK> these days shouldn't be too hard to put two and two together <UNK> but in typical style this movie goes over the top and adds some other very <UNK> and amusing anti <UNK> weapons
len: 2744
```
После обработки в начало отзыва добавилось необходимое количество токенов <PAD>, чтобы отзыв был длинной в 500 индексов.


## 8) Вывели предобработанные массивы обучающих и тестовых данных и их размерности.

```python
# вывод данных
print('X train: \n',X_train)
print('X train: \n',X_test)

# вывод размерностей
print('Shape of X train:', X_train.shape)
print('Shape of X test:', X_test.shape)
```
```
X train: 
 [[   0    0    0 ...    4   86  273]
 [   0    0    0 ...  705    9  150]
 [   0    0    0 ...   44   12   32]
 ...
 [   0    0    0 ...   22    8  377]
 [   0    0    0 ...    4 2554  647]
 [   0    0    0 ...    2    4    2]]
X train: 
 [[  0   0   0 ... 106  14  31]
 [  0   0   0 ... 458 168  52]
 [  0   0   0 ...  22   6  31]
 ...
 [  0   0   0 ...  38  76 128]
 [  0   0   0 ...  73 290  12]
 [  0   0   0 ...  12  38  76]]
Shape of X train: (25000, 500)
Shape of X test: (25000, 500)
```

## 9) Реализовали модель рекуррентной нейронной сети, состоящей из слоев Embedding, LSTM, Dropout, Dense, и обучили ее на обучающих данных с выделением части обучающих данных в качестве валидационных. Вывели информацию об архитектуре нейронной сети. Добились качества обучения по метрике accuracy не менее 0.8.

```python
embed_dim = 32
lstm_units = 64

model = Sequential()
model.add(layers.Embedding(input_dim=vocabulary_size, output_dim=embed_dim, input_length=max_words, input_shape=(max_words,)))
model.add(layers.LSTM(lstm_units))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(1, activation='sigmoid'))

model.summary()
```
![alt text](1_p9.png)

```python
# компилируем и обучаем модель
batch_size = 64
epochs = 3
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
model.fit(X_train, y_train, batch_size=batch_size, epochs=epochs, validation_split=0.2)
```
```
Epoch 1/3
313/313 ━━━━━━━━━━━━━━━━━━━━ 12s 23ms/step - accuracy: 0.6328 - loss: 0.6075 - val_accuracy: 0.7826 - val_loss: 0.4588
Epoch 2/3
313/313 ━━━━━━━━━━━━━━━━━━━━ 6s 21ms/step - accuracy: 0.8121 - loss: 0.4143 - val_accuracy: 0.8628 - val_loss: 0.3359
Epoch 3/3
313/313 ━━━━━━━━━━━━━━━━━━━━ 10s 21ms/step - accuracy: 0.8905 - loss: 0.2795 - val_accuracy: 0.8506 - val_loss: 0.3324
<keras.src.callbacks.history.History at 0x7a8b2b9509e0>
```
```python
test_loss, test_acc = model.evaluate(X_test, y_test)
print(f"\nTest accuracy: {test_acc}")
print(f"\nTest loss: {test_loss}")
```
```
accuracy: 0.8544 
loss: 0.3396

Test accuracy: 0.8564800024032593

Test loss: 0.33131280541419983
```

## 10) Оценили качество обучения на тестовых данных.
* вывели значение метрики качества классификации на тестовых данных
* вывели отчет о качестве классификации тестовой выборки  
* построили ROC-кривую по результату обработки тестовой выборки и вычислили площадь под ROC-кривой (AUC ROC)

```python
#значение метрики качества классификации на тестовых данных
print(f"\nTest accuracy: {test_acc}")
```
```
Test accuracy: 0.8564800024032593
```

```python
#отчет о качестве классификации тестовой выборки
y_score = model.predict(X_test)
y_pred = [1 if y_score[i,0]>=0.5 else 0 for i in range(len(y_score))]

from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred, labels = [0, 1], target_names=['Negative', 'Positive']))
```
```
              precision    recall  f1-score   support

    Negative       0.83      0.90      0.86     12500
    Positive       0.89      0.82      0.85     12500

    accuracy                           0.86     25000
   macro avg       0.86      0.86      0.86     25000
weighted avg       0.86      0.86      0.86     25000
```

```python
#построение ROC-кривой и AUC ROC
from sklearn.metrics import roc_curve, auc

fpr, tpr, thresholds = roc_curve(y_test, y_score)
plt.plot(fpr, tpr)
plt.grid()
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.show()
print('AUC ROC:', auc(fpr, tpr))
```
![alt text](1_p10.png)
```
AUC ROC: 0.9356664896
```

## 11) Сделали выводы по результатам применения рекуррентной нейронной сети для решения задачи определения тональности текста. 

Таблица1:

| Модель   | Количество настраиваемых параметров | Количество эпох обучения | Качество классификации тестовой выборки |
|----------|-------------------------------------|---------------------------|-----------------------------------------|
| Рекуррентная | 184 897                              | 3                        | accuracy:0.8564     ; loss:0.3313     ; AUC ROC:0.9357                                   |


## Вывод
По результатам применения рекуррентной нейронной сети можно сделать вывод, что модель остаточно неплохо справилась с задачей определения тональности текста. Показатель accuracy = 0.8564 превышает требуемый порог 0.8. Значение AUC ROC = 0.9357 (> 0.9) говорит о высокой способности модели различать два класса на положительные и отрицательные области.