is_dnn

62 KiB

Исходник Вина История

1) В среде Google Colab создали новый блокнот (notebook). Импортировали необходимые для работы библиотеки и модули. Настроили блокнот для работы с аппаратным ускорителем GPU.

from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential
import matplotlib.pyplot as plt
import numpy as np

import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

Found GPU at: /device:GPU:0

2) Загрузили набор данных IMDb, содержащий оцифрованные отзывы на фильмы, размеченные на два класса: позитивные и негативные. При загрузке набора данных параметр seed выбрали равным значению (4k – 1)=39, где k=10 – номер бригады. Вывели размеры полученных обучающих и тестовых массивов данных.

# загрузка датасета
from keras.datasets import imdb

vocabulary_size = 5000
index_from = 3

(X_train, y_train), (X_test, y_test) = imdb.load_data(
    path="imdb.npz",
    num_words=vocabulary_size,
    skip_top=0,
    maxlen=None,
    seed=39,
    start_char=1,
    oov_char=2,
    index_from=index_from
    )

# вывод размерностей
print('Shape of X train:', X_train.shape)
print('Shape of y train:', y_train.shape)
print('Shape of X test:', X_test.shape)
print('Shape of y test:', y_test.shape)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
17464789/17464789 ━━━━━━━━━━━━━━━━━━━━ 0s 0us/step
Shape of X train: (25000,)
Shape of y train: (25000,)
Shape of X test: (25000,)
Shape of y test: (25000,)

3) Вывели один отзыв из обучающего множества в виде списка индексов слов. Преобразовали список индексов в текст и вывели отзыв в виде текста. Вывели длину отзыва. Вывели метку класса данного отзыва и название класса (1 – Positive, 0 – Negative).

# создание словаря для перевода индексов в слова
# загрузка словаря "слово:индекс"
word_to_id = imdb.get_word_index()

# уточнение словаря
word_to_id = {key:(value + index_from) for key,value in word_to_id.items()}
word_to_id["<PAD>"] = 0
word_to_id["<START>"] = 1
word_to_id["<UNK>"] = 2
word_to_id["<UNUSED>"] = 3

# создание обратного словаря "индекс:слово"
id_to_word = {value:key for key,value in word_to_id.items()}

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json
1641221/1641221 ━━━━━━━━━━━━━━━━━━━━ 0s 0us/step

print(X_train[39])
print('len:',len(X_train[39]))

[1, 3206, 2, 3413, 3852, 2, 2, 73, 256, 19, 4396, 3033, 34, 488, 2, 47, 2993, 4058, 11, 63, 29, 4653, 1496, 27, 4122, 54, 4, 1334, 1914, 380, 1587, 56, 351, 18, 147, 2, 2, 15, 29, 238, 30, 4, 455, 564, 167, 1024, 2, 2, 2, 4, 2, 65, 33, 6, 2, 1062, 3861, 6, 3793, 1166, 7, 1074, 1545, 6, 171, 2, 1134, 388, 7, 3569, 2, 567, 31, 255, 37, 47, 6, 3161, 1244, 3119, 19, 6, 2, 11, 12, 2611, 120, 41, 419, 2, 17, 4, 3777, 2, 4952, 2468, 1457, 6, 2434, 4268, 23, 4, 1780, 1309, 5, 1728, 283, 8, 113, 105, 1037, 2, 285, 11, 6, 4800, 2905, 182, 5, 2, 183, 125, 19, 6, 327, 2, 7, 2, 668, 1006, 4, 478, 116, 39, 35, 321, 177, 1525, 2294, 6, 226, 176, 2, 2, 17, 2, 1220, 119, 602, 2, 2, 592, 2, 17, 2, 2, 1405, 2, 597, 503, 1468, 2, 2, 17, 2, 1947, 3702, 884, 1265, 3378, 1561, 2, 17, 2, 2, 992, 3217, 2393, 4923, 2, 17, 2, 2, 1255, 2, 2, 2, 117, 17, 6, 254, 2, 568, 2297, 5, 2, 2, 17, 1047, 2, 2186, 2, 1479, 488, 2, 4906, 627, 166, 1159, 2552, 361, 7, 2877, 2, 2, 665, 718, 2, 2, 2, 603, 4716, 127, 4, 2873, 2, 56, 11, 646, 227, 531, 26, 670, 2, 17, 6, 2, 2, 3510, 2, 17, 6, 2, 2, 2, 3014, 17, 6, 2, 668, 2, 503, 1468, 2, 19, 11, 4, 1746, 5, 2, 4778, 11, 31, 7, 41, 1273, 154, 255, 555, 6, 1156, 5, 737, 431]
len: 274

review_as_text = ' '.join(id_to_word[id] for id in X_train[39])
print(review_as_text)
print('len:',len(review_as_text))

<START> troubled <UNK> magazine photographer <UNK> <UNK> well played with considerable intensity by michael <UNK> has horrific nightmares in which he brutally murders his models when the lovely ladies start turning up dead for real <UNK> <UNK> that he might be the killer writer director william <UNK> <UNK> <UNK> the <UNK> story at a <UNK> pace builds a reasonable amount of tension delivers a few <UNK> effective moments of savage <UNK> violence one woman who has a plastic garbage bag with a <UNK> in it placed over her head <UNK> as the definite <UNK> inducing highlight puts a refreshing emphasis on the nicely drawn and engaging true to life characters further <UNK> everything in a plausible everyday world and <UNK> things off with a nice <UNK> of <UNK> female nudity the fine acting from an excellent cast helps matters a whole lot <UNK> <UNK> as <UNK> charming love interest <UNK> <UNK> james <UNK> as <UNK> <UNK> double <UNK> brother b j <UNK> <UNK> as <UNK> concerned psychiatrist dr frank curtis don <UNK> as <UNK> <UNK> gay assistant louis pamela <UNK> as <UNK> <UNK> detective <UNK> <UNK> <UNK> little as a hard <UNK> police chief and <UNK> <UNK> as sweet <UNK> model <UNK> r michael <UNK> polished cinematography makes impressive occasional use of breathtaking <UNK> <UNK> shots jack <UNK> <UNK> <UNK> score likewise does the trick <UNK> up in cool bit parts are robert <UNK> as a <UNK> <UNK> sally <UNK> as a <UNK> <UNK> <UNK> shower as a <UNK> female <UNK> b j <UNK> with in the ring and <UNK> bay in one of her standard old woman roles a solid and enjoyable picture
len: 1584

4) Вывели максимальную и минимальную длину отзыва в обучающем множестве.

print('MAX Len: ',len(max(X_train, key=len)))
print('MIN Len: ',len(min(X_train, key=len)))

MAX Len:  2494
MIN Len:  11

5) Провели предобработку данных. Выбрали единую длину, к которой будут приведены все отзывы. Короткие отзывы дополнили спецсимволами, а длинные обрезали до выбранной длины.

#предобработка данных
from tensorflow.keras.utils import pad_sequences

max_words = 500
X_train = pad_sequences(X_train, maxlen=max_words, value=0, padding='pre', truncating='post')
X_test = pad_sequences(X_test, maxlen=max_words, value=0, padding='pre', truncating='post')

Повторили пункт 4.

print('MAX Len: ',len(max(X_train, key=len)))
print('MIN Len: ',len(min(X_train, key=len)))

MAX Len:  500
MIN Len:  500

7) Повторили пункт 3. Сделали вывод о том, как отзыв преобразовался после предобработки.

print(X_train[39])
print('len:',len(X_train[39]))

[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    1 3206    2 3413 3852    2    2   73  256   19 4396 3033
   34  488    2   47 2993 4058   11   63   29 4653 1496   27 4122   54
    4 1334 1914  380 1587   56  351   18  147    2    2   15   29  238
   30    4  455  564  167 1024    2    2    2    4    2   65   33    6
    2 1062 3861    6 3793 1166    7 1074 1545    6  171    2 1134  388
    7 3569    2  567   31  255   37   47    6 3161 1244 3119   19    6
    2   11   12 2611  120   41  419    2   17    4 3777    2 4952 2468
 1457    6 2434 4268   23    4 1780 1309    5 1728  283    8  113  105
 1037    2  285   11    6 4800 2905  182    5    2  183  125   19    6
  327    2    7    2  668 1006    4  478  116   39   35  321  177 1525
 2294    6  226  176    2    2   17    2 1220  119  602    2    2  592
    2   17    2    2 1405    2  597  503 1468    2    2   17    2 1947
 3702  884 1265 3378 1561    2   17    2    2  992 3217 2393 4923    2
   17    2    2 1255    2    2    2  117   17    6  254    2  568 2297
    5    2    2   17 1047    2 2186    2 1479  488    2 4906  627  166
 1159 2552  361    7 2877    2    2  665  718    2    2    2  603 4716
  127    4 2873    2   56   11  646  227  531   26  670    2   17    6
    2    2 3510    2   17    6    2    2    2 3014   17    6    2  668
    2  503 1468    2   19   11    4 1746    5    2 4778   11   31    7
   41 1273  154  255  555    6 1156    5  737  431]
len: 500

review_as_text = ' '.join(id_to_word[id] for id in X_train[39])
print(review_as_text)
print('len:',len(review_as_text))

<PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <START> troubled <UNK> magazine photographer <UNK> <UNK> well played with considerable intensity by michael <UNK> has horrific nightmares in which he brutally murders his models when the lovely ladies start turning up dead for real <UNK> <UNK> that he might be the killer writer director william <UNK> <UNK> <UNK> the <UNK> story at a <UNK> pace builds a reasonable amount of tension delivers a few <UNK> effective moments of savage <UNK> violence one woman who has a plastic garbage bag with a <UNK> in it placed over her head <UNK> as the definite <UNK> inducing highlight puts a refreshing emphasis on the nicely drawn and engaging true to life characters further <UNK> everything in a plausible everyday world and <UNK> things off with a nice <UNK> of <UNK> female nudity the fine acting from an excellent cast helps matters a whole lot <UNK> <UNK> as <UNK> charming love interest <UNK> <UNK> james <UNK> as <UNK> <UNK> double <UNK> brother b j <UNK> <UNK> as <UNK> concerned psychiatrist dr frank curtis don <UNK> as <UNK> <UNK> gay assistant louis pamela <UNK> as <UNK> <UNK> detective <UNK> <UNK> <UNK> little as a hard <UNK> police chief and <UNK> <UNK> as sweet <UNK> model <UNK> r michael <UNK> polished cinematography makes impressive occasional use of breathtaking <UNK> <UNK> shots jack <UNK> <UNK> <UNK> score likewise does the trick <UNK> up in cool bit parts are robert <UNK> as a <UNK> <UNK> sally <UNK> as a <UNK> <UNK> <UNK> shower as a <UNK> female <UNK> b j <UNK> with in the ring and <UNK> bay in one of her standard old woman roles a solid and enjoyable picture
len: 2940

8) Вывели предобработанные массивы обучающих и тестовых данных и их размерности.

# вывод данных
print('X train: \n',X_train)
print('X train: \n',X_test)

# вывод размерностей
print('Shape of X train:', X_train.shape)
print('Shape of X test:', X_test.shape)

X train: 
 [[   0    0    0 ...    7    4 2407]
 [   0    0    0 ...   34  705    2]
 [   0    0    0 ... 2222    8  369]
 ...
 [   0    0    0 ...   11    4 4596]
 [   0    0    0 ...  574   42   24]
 [   0    0    0 ...    7   13 3891]]
X train: 
 [[   0    0    0 ...    6   52   20]
 [   0    0    0 ...   62   30  821]
 [   0    0    0 ...   24 3081   25]
 ...
 [   0    0    0 ...   19  666 3159]
 [   0    0    0 ...    7   15 1716]
 [   0    0    0 ... 1194   61  113]]
Shape of X train: (25000, 500)
Shape of X test: (25000, 500)

9) Реализовали модель рекуррентной нейронной сети, состоящей из слоев Embedding, LSTM, Dropout, Dense, и обучили ее на обучающих данных с выделением части обучающих данных в качестве валидационных. Вывели информацию об архитектуре нейронной сети. Добились качества обучения по метрике accuracy не менее 0.8.

embed_dim = 32
lstm_units = 64

model = Sequential()
model.add(layers.Embedding(input_dim=vocabulary_size, output_dim=embed_dim, input_length=max_words, input_shape=(max_words,)))
model.add(layers.LSTM(lstm_units))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(1, activation='sigmoid'))

model.summary()

/usr/local/lib/python3.12/dist-packages/keras/src/layers/core/embedding.py:97: UserWarning: Argument `input_length` is deprecated. Just remove it.
  warnings.warn(
/usr/local/lib/python3.12/dist-packages/keras/src/layers/core/embedding.py:100: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead.
  super().__init__(**kwargs)

Model: "sequential"

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ embedding (Embedding)           │ (None, 500, 32)        │       160,000 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ lstm (LSTM)                     │ (None, 64)             │        24,832 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dropout (Dropout)               │ (None, 64)             │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense (Dense)                   │ (None, 1)              │            65 │
└─────────────────────────────────┴────────────────────────┴───────────────┘

 Total params: 184,897 (722.25 KB)

 Trainable params: 184,897 (722.25 KB)

 Non-trainable params: 0 (0.00 B)

# компилируем и обучаем модель
batch_size = 64
epochs = 3
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
model.fit(X_train, y_train, batch_size=batch_size, epochs=epochs, validation_split=0.2)

Epoch 1/3
313/313 ━━━━━━━━━━━━━━━━━━━━ 11s 23ms/step - accuracy: 0.6315 - loss: 0.6268 - val_accuracy: 0.8072 - val_loss: 0.4273
Epoch 2/3
313/313 ━━━━━━━━━━━━━━━━━━━━ 6s 20ms/step - accuracy: 0.8559 - loss: 0.3469 - val_accuracy: 0.8496 - val_loss: 0.3603
Epoch 3/3
313/313 ━━━━━━━━━━━━━━━━━━━━ 7s 21ms/step - accuracy: 0.8993 - loss: 0.2662 - val_accuracy: 0.8666 - val_loss: 0.3242

<keras.src.callbacks.history.History at 0x7ee57a9e6d20>

test_loss, test_acc = model.evaluate(X_test, y_test)
print(f"\nTest accuracy: {test_acc}")

782/782 ━━━━━━━━━━━━━━━━━━━━ 7s 9ms/step - accuracy: 0.8714 - loss: 0.3110

Test accuracy: 0.8674799799919128

10) Оценили качество обучения на тестовых данных:

вывели значение метрики качества классификации на тестовых данных
вывели отчет о качестве классификации тестовой выборки
построили ROC-кривую по результату обработки тестовой выборки и вычислили площадь под ROC-кривой (AUC ROC)

#значение метрики качества классификации на тестовых данных
print(f"\nTest accuracy: {test_acc}")


Test accuracy: 0.8674799799919128

#отчет о качестве классификации тестовой выборки
y_score = model.predict(X_test)
y_pred = [1 if y_score[i,0]>=0.5 else 0 for i in range(len(y_score))]

from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred, labels = [0, 1], target_names=['Negative', 'Positive']))

782/782 ━━━━━━━━━━━━━━━━━━━━ 6s 7ms/step
              precision    recall  f1-score   support

    Negative       0.86      0.88      0.87     12500
    Positive       0.87      0.86      0.87     12500

    accuracy                           0.87     25000
   macro avg       0.87      0.87      0.87     25000
weighted avg       0.87      0.87      0.87     25000

#построение ROC-кривой и AUC ROC
from sklearn.metrics import roc_curve, auc

fpr, tpr, thresholds = roc_curve(y_test, y_score)
plt.plot(fpr, tpr)
plt.grid()
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.show()
print('AUC ROC:', auc(fpr, tpr))

AUC ROC: 0.9387573504

62 KiB Исходник Вина История

62 KiB

Исходник Вина История