{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Decomposition" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [Malaya/example/topic-modeling-decomposition](https://github.com/huseinzol05/Malaya/tree/master/example/topic-modeling-decomposition).\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import malaya" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('tests/02032018.csv',sep=';')\n", "df = df.iloc[3:,1:]\n", "df.columns = ['text','label']\n", "corpus = df.text.tolist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can get this file https://github.com/huseinzol05/malaya/blob/master/tests/02032018.csv . **This csv already stemmed.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load vectorizer object\n", "\n", "You can use `TfidfVectorizer`, `CountVectorizer`, or any vectorizer as long `fit_transform` method exists." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from malaya.text.vectorizer import SkipGramCountVectorizer\n", "\n", "stopwords = malaya.text.function.get_stopwords()\n", "vectorizer = SkipGramCountVectorizer(\n", " max_df = 0.95,\n", " min_df = 1,\n", " ngram_range = (1, 3),\n", " stop_words = stopwords,\n", " skip = 2\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Train SKLearn LDA model\n", "\n", "```python\n", "def fit(\n", " corpus: List[str],\n", " model,\n", " vectorizer,\n", " n_topics: int,\n", " cleaning=simple_textcleaning,\n", " stopwords=get_stopwords,\n", " **kwargs,\n", "):\n", " \"\"\"\n", " Train a SKlearn model to do topic modelling based on corpus given.\n", "\n", " Parameters\n", " ----------\n", " corpus: list\n", " model : object\n", " Should have `fit_transform` method. Commonly:\n", "\n", " * ``sklearn.decomposition.TruncatedSVD`` - LSA algorithm.\n", " * ``sklearn.decomposition.LatentDirichletAllocation`` - LDA algorithm.\n", " * ``sklearn.decomposition.NMF`` - NMF algorithm.\n", " vectorizer : object\n", " Should have `fit_transform` method. Commonly:\n", "\n", " * ``sklearn.feature_extraction.text.TfidfVectorizer`` - TFIDF algorithm.\n", " * ``sklearn.feature_extraction.text.CountVectorizer`` - Bag-of-Word algorithm.\n", " * ``malaya.text.vectorizer.SkipGramCountVectorizer`` - Skip Gram Bag-of-Word algorithm.\n", " * ``malaya.text.vectorizer.SkipGramTfidfVectorizer`` - Skip Gram TFIDF algorithm.\n", " n_topics: int, (default=10)\n", " size of decomposition column.\n", " cleaning: function, (default=malaya.text.function.simple_textcleaning)\n", " function to clean the corpus.\n", " stopwords: List[str], (default=malaya.texts.function.get_stopwords)\n", " A callable that returned a List[str], or a List[str], or a Tuple[str]\n", "\n", " Returns\n", " -------\n", " result: malaya.topic_model.decomposition.Topic class\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "from sklearn.decomposition import LatentDirichletAllocation\n", "\n", "lda = malaya.topic_model.decomposition.fit(\n", " corpus,\n", " LatentDirichletAllocation,\n", " vectorizer = vectorizer,\n", " n_topics = 10,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Print topics\n", "\n", "```python\n", "def top_topics(\n", " self, len_topic: int, top_n: int = 10, return_df: bool = True\n", "):\n", " \"\"\"\n", " Print important topics based on decomposition.\n", "\n", " Parameters\n", " ----------\n", " len_topic: int\n", " size of topics.\n", " top_n: int, optional (default=10)\n", " top n of each topic.\n", " return_df: bool, optional (default=True)\n", " return as pandas.DataFrame, else JSON.\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
topic 0topic 1topic 2topic 3topic 4
0kerajaannegara selatanpartimalaysiabahasa
1hutangpembangunannya negara selatanpilihan rayamenteripertumbuhan
2negaranegararayarakyatnegara
3subjekprojekpilihanperdanamalaysia
4mdbmalaysiaperniagaan digital santannegarailmu
5menerimapengalamanberjalan lancarasliparti
6islamberkongsilancarkerajaanbersatu
7kemudahanberkongsi pengalamanperlembagaanperdana menterikerajaan
8menjualselatanberjalanrumahmdb
9takutnegara negara selatanumnorakyat malaysiapelbagai
\n", "
" ], "text/plain": [ " topic 0 topic 1 topic 2 \\\n", "0 kerajaan negara selatan parti \n", "1 hutang pembangunannya negara selatan pilihan raya \n", "2 negara negara raya \n", "3 subjek projek pilihan \n", "4 mdb malaysia perniagaan digital santan \n", "5 menerima pengalaman berjalan lancar \n", "6 islam berkongsi lancar \n", "7 kemudahan berkongsi pengalaman perlembagaan \n", "8 menjual selatan berjalan \n", "9 takut negara negara selatan umno \n", "\n", " topic 3 topic 4 \n", "0 malaysia bahasa \n", "1 menteri pertumbuhan \n", "2 rakyat negara \n", "3 perdana malaysia \n", "4 negara ilmu \n", "5 asli parti \n", "6 kerajaan bersatu \n", "7 perdana menteri kerajaan \n", "8 rumah mdb \n", "9 rakyat malaysia pelbagai " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lda.top_topics(5, top_n = 10, return_df = True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Important sentences based on topics\n", "\n", "```python\n", "def get_sentences(self, len_sentence: int, k: int = 0):\n", " \"\"\"\n", " Return important sentences related to selected column based on decomposition.\n", "\n", " Parameters\n", " ----------\n", " len_sentence: int\n", " k: int, (default=0)\n", " index of decomposition matrix.\n", "\n", " Returns\n", " -------\n", " result: List[str]\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['saya fikir berucaplah sekuat mana sekali pun tulislah sebanyak mana kertas cadangan tapi sekiranya rejim yang sama terus mentadbir negara perubahan tuntas yang kita mahukan mustahil akan terlaksana kerana political will itu tidak wujud',\n", " 'kami telah menerima permohonan untuk menjadikan flat perumahan sedia ada menjadi bilik darjah tingkatan enam dan kementerian telah meluluskan peruntukan rm juta bagi projek tersebut',\n", " 'kami menjual syarikat pajakan kami pagi ini dan wang yang diterima akan meningkatkan ringgit jadi ia akan mencerminkan dengan lebih baik lagi kekuatan sebenar dan daya tahan ekonomi malaysia',\n", " 'kerajaan dan rakan sekerja saya menteri pengangkutan datuk seri liow tiong lai akan memastikan anda berkembang dan berjaya melampaui imaginasi kami',\n", " 'kami menolak sebarang percubaan untuk merosak atau memusnahkan tanah suci dan akan menjaga tempat itu sebaik baiknya serta memberi tumpuan kepada kepentingan islam']" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lda.get_sentences(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Get topics\n", "\n", "```python\n", "def get_topics(self, len_topic: int):\n", " \"\"\"\n", " Return important topics based on decomposition.\n", "\n", " Parameters\n", " ----------\n", " len_topic: int\n", "\n", " Returns\n", " -------\n", " result: List[str]\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(0,\n", " 'kerajaan hutang negara subjek mdb menerima islam kemudahan menjual takut'),\n", " (1,\n", " 'negara selatan pembangunannya negara selatan negara projek malaysia pengalaman berkongsi berkongsi pengalaman selatan negara negara selatan'),\n", " (2,\n", " 'parti pilihan raya raya pilihan perniagaan digital santan berjalan lancar lancar perlembagaan berjalan umno'),\n", " (3,\n", " 'malaysia menteri rakyat perdana negara asli kerajaan perdana menteri rumah rakyat malaysia'),\n", " (4,\n", " 'bahasa pertumbuhan negara malaysia ilmu parti bersatu kerajaan mdb pelbagai'),\n", " (5,\n", " 'parti rakyat keputusan ahli mengambil awam masyarakat pru pendapatan kerajaan'),\n", " (6,\n", " 'malaysia kerajaan rakyat masyarakat rakyat malaysia cina strategi kenyataan dasar hilang'),\n", " (7,\n", " 'undi kerajaan dakwaan dana kenyataan bidang kementerian syarikat sebarang doj'),\n", " (8,\n", " 'negara mdb kewangan syarikat negara bidang negara bidang perancangan negara maju negara maju perancangan membantu negara membantu negara bidang'),\n", " (9, 'malaysia harga ambil rm air usaha sabah anti dasar dijual')]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lda.get_topics(10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 4 }