Вы не можете выбрать более 25 тем Темы должны начинаться с буквы или цифры, могут содержать дефисы(-) и должны содержать не более 35 символов.

766 строки
66 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "gz18QPRz03Ec"
},
"source": [
"### 1) В среде Google Colab создали новый блокнот (notebook). Импортировали необходимые для работы библиотеки и модули. Настроили блокнот для работы с аппаратным ускорителем GPU."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"id": "mr9IszuQ1ANG"
},
"outputs": [],
"source": [
"# импорт модулей\n",
"import os\n",
"\n",
"from tensorflow import keras\n",
"from tensorflow.keras import layers\n",
"from tensorflow.keras.models import Sequential\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "FFRtE0TN1AiA"
},
"source": [
"### 2) Загрузили набор данных IMDb, содержащий оцифрованные отзывы на фильмы, размеченные на два класса: позитивные и негативные. При загрузке набора данных параметр seed выбрали равным значению (4k – 1)=31, где k=8 – номер бригады. Вывели размеры полученных обучающих и тестовых массивов данных."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"id": "Ixw5Sp0_1A-w"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[1m17464789/17464789\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m7s\u001b[0m 0us/step\n",
"Shape of X train: (25000,)\n",
"Shape of y train: (25000,)\n",
"Shape of X test: (25000,)\n",
"Shape of y test: (25000,)\n"
]
}
],
"source": [
"# загрузка датасета\n",
"from keras.datasets import imdb\n",
"\n",
"vocabulary_size = 5000\n",
"index_from = 3\n",
"\n",
"(X_train, y_train), (X_test, y_test) = imdb.load_data(\n",
" path=\"imdb.npz\",\n",
" num_words=vocabulary_size,\n",
" skip_top=0,\n",
" maxlen=None,\n",
" seed=31,\n",
" start_char=1,\n",
" oov_char=2,\n",
" index_from=index_from\n",
" )\n",
"\n",
"# вывод размерностей\n",
"print('Shape of X train:', X_train.shape)\n",
"print('Shape of y train:', y_train.shape)\n",
"print('Shape of X test:', X_test.shape)\n",
"print('Shape of y test:', y_test.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "aCo_lUXl1BPV"
},
"source": [
"### 3) Вывели один отзыв из обучающего множества в виде списка индексов слов. Преобразовали список индексов в текст и вывели отзыв в виде текста. Вывели длину отзыва. Вывели метку класса данного отзыва и название класса (1 – Positive, 0 – Negative)."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"id": "9W3RklPcZyH0"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json\n",
"\u001b[1m1641221/1641221\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m3s\u001b[0m 2us/step\n"
]
}
],
"source": [
"# создание словаря для перевода индексов в слова\n",
"# заргузка словаря \"слово:индекс\"\n",
"word_to_id = imdb.get_word_index()\n",
"# уточнение словаря\n",
"word_to_id = {key:(value + index_from) for key,value in word_to_id.items()}\n",
"word_to_id[\"<PAD>\"] = 0\n",
"word_to_id[\"<START>\"] = 1\n",
"word_to_id[\"<UNK>\"] = 2\n",
"word_to_id[\"<UNUSED>\"] = 3\n",
"# создание обратного словаря \"индекс:слово\"\n",
"id_to_word = {value:key for key,value in word_to_id.items()}"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"id": "Nu-Bs1jnaYhB"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1, 13, 805, 8, 40, 14, 1179, 40, 13, 353, 8, 358, 32, 1179, 108, 13, 384, 3091, 2, 1849, 19, 6, 117, 1006, 5, 49, 836, 89, 70, 25, 140, 355, 21, 2, 13, 104, 9, 35, 463, 7, 15, 2063, 170, 355, 4, 293, 1834, 9, 4, 527, 116, 7, 4, 293, 289, 539, 15, 2, 56, 11, 4, 313, 12, 16, 17, 48, 36, 71, 467, 2, 5, 12, 2230, 72, 39, 126, 397, 928, 11, 68, 4598, 4, 22, 2, 18, 836, 5, 2, 21, 4, 34, 4, 1396, 458, 2, 12, 7, 148, 5, 889, 4, 20, 184, 753, 45, 6, 902, 88, 48, 4, 20, 16, 128, 2142, 12, 62, 28, 28, 77, 2, 4, 65, 5, 105, 26, 184, 948, 5, 50, 26, 49, 465, 5, 2, 1984, 388, 7, 4347, 200, 4, 452, 4, 539, 5, 4, 577, 11, 4, 154, 313, 225, 49, 52, 1006, 5, 2552, 2, 2, 43, 24, 195, 8, 202, 4, 22, 4, 1968, 12, 887, 4, 1962, 9, 184, 2509, 5, 2, 5, 127, 202, 4, 22, 6, 194, 2, 21, 1038, 94, 99, 117, 99, 522, 38, 11, 61, 652, 31, 8, 798, 894, 25, 66, 119, 3720, 1179, 108, 225, 6, 1257, 1166, 7, 986, 21, 4, 22, 1545, 99, 117, 8, 30, 2640]\n",
"len: 220\n"
]
}
],
"source": [
"print(X_train[26])\n",
"print('len:',len(X_train[26]))"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"id": "JhTwTurtZ6Sp"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<START> i tried to like this slasher like i try to enjoy all slasher films i mean mindless <UNK> mixed with a little nudity and some suspense how can you go wrong but <UNK> i think is an example of that formula going wrong the main issue is the horrible acting of the main three girls that <UNK> up in the house it was as if they were under <UNK> and it stopped me from ever getting interested in their plight the film <UNK> for suspense and <UNK> but the by the numbers direction <UNK> it of those and leaves the movie pretty dull it's a shame because if the movie was better executed it would have have been <UNK> the story and characters are pretty creepy and there are some dark and <UNK> humorous moments of interaction between the mother the girls and the daughter in the old house there's some good nudity and occasional <UNK> <UNK> just not enough to give the film the kick it needed the finale is pretty twisted and <UNK> and does give the film a big <UNK> but sadly its too little too late so in my opinion one to avoid unless you really love obscure slasher films there's a fair amount of potential but the film delivers too little to be worthwhile\n",
"len: 1159\n",
"Label: 0 ( Negative )\n"
]
}
],
"source": [
"review_as_text = ' '.join(id_to_word[id] for id in X_train[26])\n",
"print(review_as_text)\n",
"print('len:',len(review_as_text))\n",
"print('Label:', y_train[26], '(', 'Positive' if y_train[26] == 1 else 'Negative', ')')"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4hclnNaD1BuB"
},
"source": [
"### 4) Вывели максимальную и минимальную длину отзыва в обучающем множестве."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"id": "xJH87ISq1B9h"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"MAX Len: 2494\n",
"MIN Len: 11\n"
]
}
],
"source": [
"print('MAX Len: ',len(max(X_train, key=len)))\n",
"print('MIN Len: ',len(min(X_train, key=len)))"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "7x99O8ig1CLh"
},
"source": [
"### 5) Провели предобработку данных. Выбрали единую длину, к которой будут приведены все отзывы. Короткие отзывы дополнили спецсимволами, а длинные обрезали до выбранной длины."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"id": "lrF-B2aScR4t"
},
"outputs": [],
"source": [
"# предобработка данных\n",
"from tensorflow.keras.utils import pad_sequences\n",
"max_words = 500\n",
"X_train = pad_sequences(X_train, maxlen=max_words, value=0, padding='pre', truncating='post')\n",
"X_test = pad_sequences(X_test, maxlen=max_words, value=0, padding='pre', truncating='post')"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "HL2_LVga1C3l"
},
"source": [
"### 6) Повторили пункт 4."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"id": "81Cgq8dn9uL6"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"MAX Len: 500\n",
"MIN Len: 500\n"
]
}
],
"source": [
"print('MAX Len: ',len(max(X_train, key=len)))\n",
"print('MIN Len: ',len(min(X_train, key=len)))"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "KzrVY1SR1DZh"
},
"source": [
"### 7) Повторили пункт 3. Сделали вывод о том, как отзыв преобразовался после предобработки."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"id": "vudlgqoCbjU1"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 1 13 805 8 40 14 1179 40 13 353 8 358 32 1179\n",
" 108 13 384 3091 2 1849 19 6 117 1006 5 49 836 89\n",
" 70 25 140 355 21 2 13 104 9 35 463 7 15 2063\n",
" 170 355 4 293 1834 9 4 527 116 7 4 293 289 539\n",
" 15 2 56 11 4 313 12 16 17 48 36 71 467 2\n",
" 5 12 2230 72 39 126 397 928 11 68 4598 4 22 2\n",
" 18 836 5 2 21 4 34 4 1396 458 2 12 7 148\n",
" 5 889 4 20 184 753 45 6 902 88 48 4 20 16\n",
" 128 2142 12 62 28 28 77 2 4 65 5 105 26 184\n",
" 948 5 50 26 49 465 5 2 1984 388 7 4347 200 4\n",
" 452 4 539 5 4 577 11 4 154 313 225 49 52 1006\n",
" 5 2552 2 2 43 24 195 8 202 4 22 4 1968 12\n",
" 887 4 1962 9 184 2509 5 2 5 127 202 4 22 6\n",
" 194 2 21 1038 94 99 117 99 522 38 11 61 652 31\n",
" 8 798 894 25 66 119 3720 1179 108 225 6 1257 1166 7\n",
" 986 21 4 22 1545 99 117 8 30 2640]\n",
"len: 500\n"
]
}
],
"source": [
"print(X_train[26])\n",
"print('len:',len(X_train[26]))"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"id": "dbfkWjDI1Dp7"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
i tried to like this slasher like i try to enjoy all slasher films i mean mindless <UNK> mixed with a little nudity and some suspense how can you go wrong but <UNK> i think is an example of that formula going wrong the main issue is the horrible acting of the main three girls that <UNK> up in the house it was as if they were under <UNK> and it stopped me from ever getting interested in their plight the film <UNK> for suspense and <UNK> but the by the numbers direction <UNK> it of those and leaves the movie pretty dull it's a shame because if the movie was better executed it would have have been <UNK> the story and characters are pretty creepy and there are some dark and <UNK> humorous moments of interaction between the mother the girls and the daughter in the old house there's some good nudity and occasional <UNK> <UNK> just not enough to give the film the kick it needed the finale is pretty twisted and <UNK> and does give the film a big <UNK> but sadly its too little too late so in my opinion one to avoid unless you really love obscure slasher films there's a fair amount of potential but the film delivers too little to be worthwhile\n",
"len: 2839\n"
]
}
],
"source": [
"review_as_text = ' '.join(id_to_word[id] for id in X_train[26])\n",
"print(review_as_text)\n",
"print('len:',len(review_as_text))"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "mJNRXo5TdPAE"
},
"source": [
"#### После обработки в начало отзыва добавилось необходимое количество токенов <PAD>, чтобы отзыв был длинной в 500 индексов."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "YgiVGr5_1D3u"
},
"source": [
"### 8) Вывели предобработанные массивы обучающих и тестовых данных и их размерности."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"id": "7MqcG_wl1EHI"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"X train: \n",
" [[ 0 0 0 ... 2 4050 2]\n",
" [ 0 0 0 ... 721 90 180]\n",
" [ 0 0 0 ... 1114 2 174]\n",
" ...\n",
" [ 1 1065 2022 ... 7 1514 2]\n",
" [ 0 0 0 ... 6 879 132]\n",
" [ 0 0 0 ... 12 152 157]]\n",
"X train: \n",
" [[ 0 0 0 ... 10 342 158]\n",
" [ 0 0 0 ... 2 67 12]\n",
" [ 0 0 0 ... 1242 1095 1095]\n",
" ...\n",
" [ 0 0 0 ... 4 2 136]\n",
" [ 0 0 0 ... 14 31 591]\n",
" [ 0 0 0 ... 7 3923 212]]\n",
"Shape of X train: (25000, 500)\n",
"Shape of X test: (25000, 500)\n"
]
}
],
"source": [
"# вывод данных\n",
"print('X train: \\n',X_train)\n",
"print('X train: \\n',X_test)\n",
"\n",
"# вывод размерностей\n",
"print('Shape of X train:', X_train.shape)\n",
"print('Shape of X test:', X_test.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "amaspXGW1EVy"
},
"source": [
"### 9) Реализовали модель рекуррентной нейронной сети, состоящей из слоев Embedding, LSTM, Dropout, Dense, и обучили ее на обучающих данных с выделением части обучающих данных в качестве валидационных. Вывели информацию об архитектуре нейронной сети. Добились качества обучения по метрике accuracy не менее 0.8."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"id": "ktWEeqWd1EyF"
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"c:\\Users\\Admin\\AppData\\Local\\Programs\\Python\\Python311\\Lib\\site-packages\\keras\\src\\layers\\core\\embedding.py:97: UserWarning: Argument `input_length` is deprecated. Just remove it.\n",
" warnings.warn(\n",
"c:\\Users\\Admin\\AppData\\Local\\Programs\\Python\\Python311\\Lib\\site-packages\\keras\\src\\layers\\core\\embedding.py:100: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead.\n",
" super().__init__(**kwargs)\n"
]
},
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-weight: bold\">Model: \"sequential\"</span>\n",
"</pre>\n"
],
"text/plain": [
"\u001b[1mModel: \"sequential\"\u001b[0m\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓\n",
"┃<span style=\"font-weight: bold\"> Layer (type) </span>┃<span style=\"font-weight: bold\"> Output Shape </span>┃<span style=\"font-weight: bold\"> Param # </span>┃\n",
"┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩\n",
"│ embedding (<span style=\"color: #0087ff; text-decoration-color: #0087ff\">Embedding</span>) │ (<span style=\"color: #00d7ff; text-decoration-color: #00d7ff\">None</span>, <span style=\"color: #00af00; text-decoration-color: #00af00\">500</span>, <span style=\"color: #00af00; text-decoration-color: #00af00\">32</span>) │ <span style=\"color: #00af00; text-decoration-color: #00af00\">160,000</span> │\n",
"├─────────────────────────────────┼────────────────────────┼───────────────┤\n",
"│ lstm (<span style=\"color: #0087ff; text-decoration-color: #0087ff\">LSTM</span>) │ (<span style=\"color: #00d7ff; text-decoration-color: #00d7ff\">None</span>, <span style=\"color: #00af00; text-decoration-color: #00af00\">64</span>) │ <span style=\"color: #00af00; text-decoration-color: #00af00\">24,832</span> │\n",
"├─────────────────────────────────┼────────────────────────┼───────────────┤\n",
"│ dropout (<span style=\"color: #0087ff; text-decoration-color: #0087ff\">Dropout</span>) │ (<span style=\"color: #00d7ff; text-decoration-color: #00d7ff\">None</span>, <span style=\"color: #00af00; text-decoration-color: #00af00\">64</span>) │ <span style=\"color: #00af00; text-decoration-color: #00af00\">0</span> │\n",
"├─────────────────────────────────┼────────────────────────┼───────────────┤\n",
"│ dense (<span style=\"color: #0087ff; text-decoration-color: #0087ff\">Dense</span>) │ (<span style=\"color: #00d7ff; text-decoration-color: #00d7ff\">None</span>, <span style=\"color: #00af00; text-decoration-color: #00af00\">1</span>) │ <span style=\"color: #00af00; text-decoration-color: #00af00\">65</span> │\n",
"└─────────────────────────────────┴────────────────────────┴───────────────┘\n",
"</pre>\n"
],
"text/plain": [
"┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓\n",
"┃\u001b[1m \u001b[0m\u001b[1mLayer (type) \u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mOutput Shape \u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1m Param #\u001b[0m\u001b[1m \u001b[0m┃\n",
"┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩\n",
"│ embedding (\u001b[38;5;33mEmbedding\u001b[0m) │ (\u001b[38;5;45mNone\u001b[0m, \u001b[38;5;34m500\u001b[0m, \u001b[38;5;34m32\u001b[0m) │ \u001b[38;5;34m160,000\u001b[0m │\n",
"├─────────────────────────────────┼────────────────────────┼───────────────┤\n",
"│ lstm (\u001b[38;5;33mLSTM\u001b[0m) │ (\u001b[38;5;45mNone\u001b[0m, \u001b[38;5;34m64\u001b[0m) │ \u001b[38;5;34m24,832\u001b[0m │\n",
"├─────────────────────────────────┼────────────────────────┼───────────────┤\n",
"│ dropout (\u001b[38;5;33mDropout\u001b[0m) │ (\u001b[38;5;45mNone\u001b[0m, \u001b[38;5;34m64\u001b[0m) │ \u001b[38;5;34m0\u001b[0m │\n",
"├─────────────────────────────────┼────────────────────────┼───────────────┤\n",
"│ dense (\u001b[38;5;33mDense\u001b[0m) │ (\u001b[38;5;45mNone\u001b[0m, \u001b[38;5;34m1\u001b[0m) │ \u001b[38;5;34m65\u001b[0m │\n",
"└─────────────────────────────────┴────────────────────────┴───────────────┘\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-weight: bold\"> Total params: </span><span style=\"color: #00af00; text-decoration-color: #00af00\">184,897</span> (722.25 KB)\n",
"</pre>\n"
],
"text/plain": [
"\u001b[1m Total params: \u001b[0m\u001b[38;5;34m184,897\u001b[0m (722.25 KB)\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-weight: bold\"> Trainable params: </span><span style=\"color: #00af00; text-decoration-color: #00af00\">184,897</span> (722.25 KB)\n",
"</pre>\n"
],
"text/plain": [
"\u001b[1m Trainable params: \u001b[0m\u001b[38;5;34m184,897\u001b[0m (722.25 KB)\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-weight: bold\"> Non-trainable params: </span><span style=\"color: #00af00; text-decoration-color: #00af00\">0</span> (0.00 B)\n",
"</pre>\n"
],
"text/plain": [
"\u001b[1m Non-trainable params: \u001b[0m\u001b[38;5;34m0\u001b[0m (0.00 B)\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"embed_dim = 32\n",
"lstm_units = 64\n",
"\n",
"model = Sequential()\n",
"model.add(layers.Embedding(input_dim=vocabulary_size, output_dim=embed_dim, input_length=max_words, input_shape=(max_words,)))\n",
"model.add(layers.LSTM(lstm_units))\n",
"model.add(layers.Dropout(0.5))\n",
"model.add(layers.Dense(1, activation='sigmoid'))\n",
"\n",
"model.summary()"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"id": "CuPqKpX0kQfP"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Epoch 1/5\n",
"\u001b[1m313/313\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m61s\u001b[0m 184ms/step - accuracy: 0.8464 - loss: 0.3649 - val_accuracy: 0.8366 - val_loss: 0.3726\n",
"Epoch 2/5\n",
"\u001b[1m313/313\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m58s\u001b[0m 184ms/step - accuracy: 0.8838 - loss: 0.2931 - val_accuracy: 0.8692 - val_loss: 0.3221\n",
"Epoch 3/5\n",
"\u001b[1m313/313\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m59s\u001b[0m 188ms/step - accuracy: 0.9015 - loss: 0.2519 - val_accuracy: 0.8652 - val_loss: 0.3294\n",
"Epoch 4/5\n",
"\u001b[1m313/313\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m58s\u001b[0m 185ms/step - accuracy: 0.9151 - loss: 0.2225 - val_accuracy: 0.8636 - val_loss: 0.3255\n",
"Epoch 5/5\n",
"\u001b[1m313/313\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m82s\u001b[0m 184ms/step - accuracy: 0.9162 - loss: 0.2174 - val_accuracy: 0.8660 - val_loss: 0.3360\n"
]
},
{
"data": {
"text/plain": [
"<keras.src.callbacks.history.History at 0x219fd150250>"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# компилируем и обучаем модель\n",
"batch_size = 64\n",
"epochs = 5\n",
"model.compile(loss=\"binary_crossentropy\", optimizer=\"adam\", metrics=[\"accuracy\"])\n",
"model.fit(X_train, y_train, batch_size=batch_size, epochs=epochs, validation_split=0.2)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"id": "hJIWinxymQjb"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[1m782/782\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m38s\u001b[0m 49ms/step - accuracy: 0.8659 - loss: 0.3349\n",
"\n",
"Test accuracy: 0.865880012512207\n"
]
}
],
"source": [
"test_loss, test_acc = model.evaluate(X_test, y_test)\n",
"print(f\"\\nTest accuracy: {test_acc}\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "mgrihPd61E8w"
},
"source": [
"### 10) Оценили качество обучения на тестовых данных:\n",
"### - вывели значение метрики качества классификации на тестовых данных\n",
"### - вывели отчет о качестве классификации тестовой выборки \n",
"### - построили ROC-кривую по результату обработки тестовой выборки и вычислили площадь под ROC-кривой (AUC ROC)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"id": "Rya5ABT8msha"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Test accuracy: 0.865880012512207\n"
]
}
],
"source": [
"#значение метрики качества классификации на тестовых данных\n",
"print(f\"\\nTest accuracy: {test_acc}\")"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"id": "2kHjcmnCmv0Y"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[1m782/782\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m40s\u001b[0m 50ms/step\n",
" precision recall f1-score support\n",
"\n",
" Negative 0.91 0.82 0.86 12500\n",
" Positive 0.83 0.92 0.87 12500\n",
"\n",
" accuracy 0.87 25000\n",
" macro avg 0.87 0.87 0.87 25000\n",
"weighted avg 0.87 0.87 0.87 25000\n",
"\n"
]
}
],
"source": [
"#отчет о качестве классификации тестовой выборки\n",
"y_score = model.predict(X_test)\n",
"y_pred = [1 if y_score[i,0]>=0.5 else 0 for i in range(len(y_score))]\n",
"\n",
"from sklearn.metrics import classification_report\n",
"print(classification_report(y_test, y_pred, labels = [0, 1], target_names=['Negative', 'Positive']))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Kp4AQRbcmwAx"
},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"AUC ROC: 0.9420113727999999\n"
]
}
],
"source": [
"#построение ROC-кривой и AUC ROC\n",
"from sklearn.metrics import roc_curve, auc\n",
"\n",
"fpr, tpr, thresholds = roc_curve(y_test, y_score)\n",
"plt.figure(figsize=(8, 6))\n",
"plt.plot(fpr, tpr)\n",
"plt.grid()\n",
"plt.xlabel('False Positive Rate')\n",
"plt.ylabel('True Positive Rate')\n",
"plt.title('ROC')\n",
"plt.savefig('roc_curve.png', dpi=300, bbox_inches='tight')\n",
"plt.show()\n",
"print('AUC ROC:', auc(fpr, tpr))"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "MsM3ew3d1FYq"
},
"source": [
"### 11) Сделали выводы по результатам применения рекуррентной нейронной сети для решения задачи определения тональности текста. "
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "xxFO4CXbIG88"
},
"source": [
"Таблица1:"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "xvoivjuNFlEf"
},
"source": [
"| Модель | Количество настраиваемых параметров | Количество эпох обучения | Качество классификации тестовой выборки |\n",
"|----------|-------------------------------------|---------------------------|-----------------------------------------|\n",
"| Рекуррентная | 184 897 | 5 | accuracy:0.8659 ; loss:0.3349 ; AUC ROC:0.9420 |\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "YctF8h_sIB-P"
},
"source": [
"#### По результатам применения рекуррентной нейронной сети, а также по данным таблицы 1 можно сделать вывод, что модель хорошо справилась с задачей определения тональности текста. Показатель accuracy = 0.8659 превышает требуемый порог 0.8. Значение AUC ROC = 0.9420 (> 0.9) говорит о высокой способности модели различать два класса (положительные и отрицательные отзывы). Модель показала хорошие результаты по метрикам precision и recall: для негативных отзывов precision = 0.91, recall = 0.82; для позитивных отзывов precision = 0.83, recall = 0.92."
]
}
],
"metadata": {
"accelerator": "GPU",
"colab": {
"gpuType": "T4",
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.3"
}
},
"nbformat": 4,
"nbformat_minor": 0
}