NIRMOL (নির্মল)
Bangla Offensive Language Detection API and Dataset
Nirmol (নির্মল) is a Microservice-based offensive language detection API. Detect offensive/bad/slang words in Bangla/Bengali/Banglish sentences. You can set up or host your API on any node js server.
Here input a sentence to check if this is offensive or not. Under the hood, this is using the Nirmol v1.0 API.
- Version: 1.0
- Author: Sakib Mahmud
- Created: 7 March, 2024
- Update: 7 March, 2024
If you have any questions that are beyond the scope of this help file, Please feel free to discuss here: Nirmol Discussions.
Installation
You can download the dataset from the GitHub repository but here is the Direct dataset link . You can download and use this dataset for ML and AI model training.
Nirmol API is based on:
- Node.js
- Express.js
npm package used
- body-parser
- cors
- fs
- nodemon
Run Nirmol locally
Step 1: Clone the Nirmol repository
git clone https://github.com/Sigmakib2/Nirmol.git
Step 2: Go to the Nirmol directory
cd Nirmol
Step 3: Install node modules
npm install
Step 4: Start the project
npm start
Then, open your web browser and navigate to http://localhost:3000, and you should see "Cannot GET /" displayed on the page. To test the api you have to enter something after the '/'. For example "http://localhost:3000/hello world"
You can deploy this project or any other project for free on cyclic. If you use this link and sign up then you will get free 10$ credit! Let's get 10$ free credit!.
Local Development
when you're developing locally, especially if you're running your frontend and backend on different ports or domains, you'll likely need to enable CORS (Cross-Origin Resource Sharing) to allow the frontend to communicate with the backend.
Express.js doesn't come with built-in CORS support, so you'll need to use a middleware package like cors to enable it. You can install cors using npm. In this project it is already installed.
In the index.js file you will find a line like this:
const allowedOrigins = ['https://nirmol.pages.dev'];
Here we allowed the origin 'https://nirmol.pages.dev'. In this example, only requests originating from https://nirmol.pages.dev will be allowed to access your server's resources. Replace 'https://nirmol.pages.dev' with the actual origin you want to allow. For Example, if you are using html and vs code live server then replace this link by this: 'http://127.0.0.1:5500'. Remember: origin url should not have a '/' at the end.
API Response
The API endpoint analyzes a sentence for offensive/slang words and provides additional information about the sentence.
For example here is a get request and response:
{
"bad_sentence": true,
"bad_word_list": [
"কুত্তা"
],
"normal_words": [
"একটি",
"গালি",
"বা",
"খারাপ",
"শব্দ"
],
"badness": "16.67%"
}
You can also use the POST method to get response. This feature was added by Tasnim Anas.
For POST request: the endpoint is "http://localhost:3000/"
and you have to send payload in the body like this:
{
"sentence": "Your sentence here..."
}
Here's what the response means:
- bad_sentence: Indicates whether the sentence contains any offensive/bad/slang words or not. This only returns boolean values.
- bad_word_list: Lists the offensive/bad/slang words found in the sentence.
- normal_words: Lists the words in the sentence that are considered normal or not offensive/bad/slang words.
- badness: Indicates the proportion of offensive/bad/slang words in the sentence.
Features
This can ignore special symbols like # ! @ etc. Many people on the internet use these types of special symbols within slang words and AI systems cannot detect this most of the time. For example, Hello World can be written like this "He#ll@ W@rl#d" which is so difficult for many AI systems to detect. Here we used a simple approach! When there are special symbols in a word our API ignores them and then checks that word.
This API also ignores emojis🥳
There are some words in Bangla that work as prefixes or suffixes and make other worlds toxic. You can
include the prefixes_suffixes.json
file. This API finds those words in a sentence with any
word as prefixes or suffixes and declares that whole word as a negative word.
Limitations
you cannot put any "/" symbol in the given sentence (when you are using GET method). For example, you have a text area where someone writes "Hello world/earth" and you are testing the input value without any validation or sanitization. If you do this then you will face problems like this: "Cannot GET /hello%20world/earth". So you can use the POST method for this.
Update Words (Dataset)
Suppose you have your list of offensive/bad/slang words. You want to add them to your API. Then how can you do that? Here in this repository, you can find the solution. After cloning the project you will see 3 files: input.txt, nirmol.json, and txt-2-nirmol.js.
.gitignore
index.js
🟡-> input.txt
🟢-> nirmol.json
nirmol.png
package-lock.json
package.json
prefixes_suffixes.json
README.md
tree.txt
🔴 -> txt-2-nirmol.js
+---datasets
| Nirmol-v1-dataset.csv
|
\---node_modules
Here the input.txt file contains all the offensive/bad/slang words available in the dataset. The nirmol.json contains the same data structurally, and the txt-2-nirmol.js is the script that converts the input.txt into the nirmol.json file.
Updating JSON List from Text File
- Edit the Text File:
- Locate the text file:
input.txt
- Open the text file using a text editor of your choice.
- Update the Data in the Text File:
- Modify the content of the text file according to your requirements.
- Add, remove, or edit the lines in the text file as needed.
- Save Changes:
- Save the changes made to the text file.
- Run the Node.js Script:
- Ensure that you have the Node.js script available for updating the JSON list.
- Open your terminal or command prompt.
- Navigate to the directory containing the Node.js script.
- Execute the Node.js Script:
- Run the Node.js script by executing the following command:
- Verify Output:
- Once the script execution is complete, verify that the JSON list has been updated correctly.
- Check the contents of the output JSON file (
nirmol.json
) to ensure that it reflects the changes made to the text file.
node txt-2-nirmol.js
Use Cases
Here are some use cases of this API
- Content moderation: Bangla websites often host user-generated content such as comments, forum posts, or user profiles. This API can be integrated into these platforms to automatically detect and filter out inappropriate language, thus maintaining a clean and respectful environment for users.
- Social media platforms: Social media platforms that support Bangla language content can use this API to automatically flag or filter out offensive or inappropriate content in user posts, comments, and messages, helping to maintain a positive and safe community for users.
- E-commerce platforms: E-commerce websites serving the Bangla-speaking community can utilize this API to ensure that product reviews and comments remain free from offensive language, ensuring a positive shopping experience for customers.
- Educational platforms: Educational websites and software applications targeting Bangla-speaking users can use this API to monitor and filter user-generated content in discussion forums, chatrooms, or collaborative projects, promoting a respectful and constructive learning environment.
- Parental control software: Parental control software can leverage this API to monitor and filter out inappropriate content in Bangla language websites and applications, helping parents protect their children from exposure to harmful or offensive material online.
- Chat applications: Bangla language chat applications can integrate this API to automatically detect and filter out offensive language in user messages, helping to maintain a friendly and respectful communication environment among users.
- Customer support platforms: Customer support platforms serving Bangla-speaking customers can use this API to monitor and filter out abusive or inappropriate language in customer inquiries and support tickets, ensuring a professional and respectful interaction with users.
FAQ
A FAQ is a list of frequently asked questions (FAQs) and answers on a particular topic.
Source & Credits
Dataset
- Bengali-Hate-Speech-Dataset- https://github.com/rezacsedu/Bengali-Hate-Speech-Dataset
- BNLexicon- https://github.com/sazzadcsedu/BNLexicon
- BAAD: A Multipurpose Dataset for Automatic Bangla Offensive Speech Recognition- https://data.mendeley.com/datasets/w24g8xn23c/3
Technology
- Node.js - https://nodejs.org/
- Express.js - https://expressjs.com/