Using NLP and LSTM to combat cyberbullying
Using NLP and LSTM to combat Cyberbully
What is Cyberbullying?
Cyberbullying is any type of bullying or harassment using electronic medium.
Why is Cyberbullying trend worry-some?
· A recent study by Child Rights and You, a non-governmental organisation, found that nearly 9.2% of 630 children surveyed in Delhi –National Capital Region reported that they experienced cyberbullying. [scroll.in]
· In a data released by the National Crime Records Bureau showed that cases of cyberstalking or bullying of women or children increased by 36% from 542 in 2017 to 739 in 2018. [scroll.in]
· 25 percent of students who are cyberbullied turn to self-harm to cope. [pandasecurity.com]
· A separate study found that young adults who experience cyberbullying are twice as likely to self-harm and execute suicidal behaviour. [pandasecurity.com]
How can AI techniques can help combat cyberbullying?
· NLP Techniques can be used to determine the tone of speech to detect specify sentiments such as bullying, hate speech etc.
· NLP algorithms have advantage over parental control software and keyword-spotting blockers in that they can be trained to recognize subtle and sarcastic comments
· Using Machine learning for NLP usually takes more time in training and hand crafting features, instead Deep learnings techniques can be used to improve the accuracy
· Use of these techniques is also useful because slurs and insults can often be, intentionally or not, misspelled which are better detected with deep learning techniques compared to machine learning algorithms.
A case study to understand deep learning better:
Let us explore a sample case study to understand NLP-Deep learning better. For this we have taken a Kaggle Twitter hate speech data set.
This datasets contains around 30K training tweets labelled 1 or 0 where 1 corresponds to hate speech.
From data distribution below we can see that we have only 7% data available classified as hate comment. We can use any of the balancing techniques, here we’ll use simple random oversampling.
From data distribution below we can see that we have only 7% data available classified as hate comment this warrants data imbalance techniques, for purpose of this case study however we’ll continue with current distribution.
Before proceeding we must clean the data and prepare for model.
Our text pre-processing will include the following steps:
1. removing special characters
2. convert all letters to lower case
3. remove stop words
4. lemmatization (Lemmatization looks at surrounding text to determine a given word’s part of speech)
Let’s now make test train split (20%)
Let’s set hyper parameter:
Tokenization is a method used to break raw text into smaller units (can be words, sentence, characters, or subwords) called tokens. These token help understand the context and develop NLP model.
LSTMs — are a special kind of RNN, capable of learning long-term dependencies. In LSTM we can use string of multiple words to identify to which class it belongs to. This is quite helpful with NLP.
We’ll architect sequential model and add various layers to it:
- first layer is the Embedding layer that uses 100 length vectors to represent each word. Word embedding provide a dense representation of words and their relative meanings.
- second layer is the LSTM layer with 16 neurons.
- third layer is the LSTM layer with 6 neurons.
- dense layer is the output layer which has 2 cells representing the 2 different categories in this case. activation function is sigmoid for binary-class classification.
- Finally, we’ll use adam optimizer and binary_crossentropy. Adam optimizer is currently best optimizer for handling sparse gradients and noisy problems. binary_crossentropy is used as the loss function since this problem has binary outputs.
(We must experiment with different layers and hyper parameters to train the model and get best result.)
Sample model with above layers we can see as below:
Train results with different epochs
As shown, in this simple case study we are able to achieve decent accuracy with LSTM of ~95%. With more hyperparameter trainings or different neural network model we can use better results.
Deep learnings are finding increased use in the field of NLP due to their versatility.
Current work in Industries:
There are lot of platforms that are using NLP and deep learning techniques to combat cyber bullying:
· In June 2016, Facebook introduced DeepText as “a deep learning-based text understanding engine that can understand with near-human accuracy the textual content of several thousand posts per second
· Twitter also uses AI technology to spot spam and recognize negative interactions.
For complete code you can refer my Kaggle notebook here
You can also refer my article on indiaAI here