NIRMOL (নির্মল)

Bangla Offensive Language Detection API and Dataset

Nirmol (নির্মল) is a Microservice-based offensive language detection API. Detect offensive/bad/slang words in Bangla/Bengali/Banglish sentences. You can set up or host your API on any node js server.


Here input a sentence to check if this is offensive or not. Under the hood, this is using the Nirmol v1.0 API.


  • Created: 7 March, 2024
  • Update: 7 March, 2024

If you have any questions that are beyond the scope of this help file, Please feel free to discuss here: Nirmol Discussions.

Bangla Bad word detection API


Installation

You can download the dataset from the GitHub repository but here is the Direct dataset link . You can download and use this dataset for ML and AI model training.

Nirmol API is based on:

  1. Node.js
  2. Express.js

npm package used

  1. body-parser
  2. cors
  3. fs
  4. nodemon

Run Nirmol locally

Step 1: Clone the Nirmol repository

git clone https://github.com/Sigmakib2/Nirmol.git

Step 2: Go to the Nirmol directory

cd Nirmol

Step 3: Install node modules

npm install

Step 4: Start the project

npm start

Then, open your web browser and navigate to http://localhost:3000, and you should see "Cannot GET /" displayed on the page. To test the api you have to enter something after the '/'. For example "http://localhost:3000/hello world"

You can deploy this project or any other project for free on cyclic. If you use this link and sign up then you will get free 10$ credit! Let's get 10$ free credit!.


Local Development

when you're developing locally, especially if you're running your frontend and backend on different ports or domains, you'll likely need to enable CORS (Cross-Origin Resource Sharing) to allow the frontend to communicate with the backend.

Express.js doesn't come with built-in CORS support, so you'll need to use a middleware package like cors to enable it. You can install cors using npm. In this project it is already installed.

In the index.js file you will find a line like this:

const allowedOrigins = ['https://nirmol.pages.dev'];

Here we allowed the origin 'https://nirmol.pages.dev'. In this example, only requests originating from https://nirmol.pages.dev will be allowed to access your server's resources. Replace 'https://nirmol.pages.dev' with the actual origin you want to allow. For Example, if you are using html and vs code live server then replace this link by this: 'http://127.0.0.1:5500'. Remember: origin url should not have a '/' at the end.


API Response

The API endpoint analyzes a sentence for offensive/slang words and provides additional information about the sentence.

For example here is a get request and response:

Bangla Bad word detection API

{
  "bad_sentence": true,
  "bad_word_list": [
    "কুত্তা"
  ],
  "normal_words": [
    "একটি",
    "গালি",
    "বা",
    "খারাপ",
    "শব্দ"
  ],
  "badness": "16.67%"
}

You can also use the POST method to get response. This feature was added by Tasnim Anas.

For POST request: the endpoint is "http://localhost:3000/" and you have to send payload in the body like this:

{
  "sentence": "Your sentence here..."
}

Bangla Bad word detection API

Here's what the response means:

  1. bad_sentence: Indicates whether the sentence contains any offensive/bad/slang words or not. This only returns boolean values.
  2. bad_word_list: Lists the offensive/bad/slang words found in the sentence.
  3. normal_words: Lists the words in the sentence that are considered normal or not offensive/bad/slang words.
  4. badness: Indicates the proportion of offensive/bad/slang words in the sentence.

Features

This can ignore special symbols like # ! @ etc. Many people on the internet use these types of special symbols within slang words and AI systems cannot detect this most of the time. For example, Hello World can be written like this "He#ll@ W@rl#d" which is so difficult for many AI systems to detect. Here we used a simple approach! When there are special symbols in a word our API ignores them and then checks that word.

Bangla Bad word detection API

This API also ignores emojis🥳

Bangla Bad word detection API

There are some words in Bangla that work as prefixes or suffixes and make other worlds toxic. You can include the prefixes_suffixes.json file. This API finds those words in a sentence with any word as prefixes or suffixes and declares that whole word as a negative word.

Limitations

you cannot put any "/" symbol in the given sentence (when you are using GET method). For example, you have a text area where someone writes "Hello world/earth" and you are testing the input value without any validation or sanitization. If you do this then you will face problems like this: "Cannot GET /hello%20world/earth". So you can use the POST method for this.


Update Words (Dataset)

Suppose you have your list of offensive/bad/slang words. You want to add them to your API. Then how can you do that? Here in this repository, you can find the solution. After cloning the project you will see 3 files: input.txt, nirmol.json, and txt-2-nirmol.js.


              .gitignore
              index.js
        🟡->  input.txt
        🟢->  nirmol.json
              nirmol.png
              package-lock.json
              package.json
              prefixes_suffixes.json
              README.md
              tree.txt
        🔴 -> txt-2-nirmol.js
              +---datasets
              |       Nirmol-v1-dataset.csv
              |       
              \---node_modules
            

Here the input.txt file contains all the offensive/bad/slang words available in the dataset. The nirmol.json contains the same data structurally, and the txt-2-nirmol.js is the script that converts the input.txt into the nirmol.json file.

Updating JSON List from Text File

  1. Edit the Text File:
    • Locate the text file: input.txt
    • Open the text file using a text editor of your choice.
  2. Update the Data in the Text File:
    • Modify the content of the text file according to your requirements.
    • Add, remove, or edit the lines in the text file as needed.
  3. Save Changes:
    • Save the changes made to the text file.
  4. Run the Node.js Script:
    • Ensure that you have the Node.js script available for updating the JSON list.
    • Open your terminal or command prompt.
    • Navigate to the directory containing the Node.js script.
  5. Execute the Node.js Script:
    • Run the Node.js script by executing the following command:

    • node txt-2-nirmol.js
  6. Verify Output:
    • Once the script execution is complete, verify that the JSON list has been updated correctly.
    • Check the contents of the output JSON file ( nirmol.json ) to ensure that it reflects the changes made to the text file.

Use Cases

Here are some use cases of this API

  1. Content moderation: Bangla websites often host user-generated content such as comments, forum posts, or user profiles. This API can be integrated into these platforms to automatically detect and filter out inappropriate language, thus maintaining a clean and respectful environment for users.
  2. Social media platforms: Social media platforms that support Bangla language content can use this API to automatically flag or filter out offensive or inappropriate content in user posts, comments, and messages, helping to maintain a positive and safe community for users.
  3. E-commerce platforms: E-commerce websites serving the Bangla-speaking community can utilize this API to ensure that product reviews and comments remain free from offensive language, ensuring a positive shopping experience for customers.
  4. Educational platforms: Educational websites and software applications targeting Bangla-speaking users can use this API to monitor and filter user-generated content in discussion forums, chatrooms, or collaborative projects, promoting a respectful and constructive learning environment.
  5. Parental control software: Parental control software can leverage this API to monitor and filter out inappropriate content in Bangla language websites and applications, helping parents protect their children from exposure to harmful or offensive material online.
  6. Chat applications: Bangla language chat applications can integrate this API to automatically detect and filter out offensive language in user messages, helping to maintain a friendly and respectful communication environment among users.
  7. Customer support platforms: Customer support platforms serving Bangla-speaking customers can use this API to monitor and filter out abusive or inappropriate language in customer inquiries and support tickets, ensuring a professional and respectful interaction with users.

FAQ

A FAQ is a list of frequently asked questions (FAQs) and answers on a particular topic.

This API analyzes input sentences to detect the presence of any potentially inappropriate or "bad" words.
The API accepts URL-encoded sentences as input. You can pass a sentence directly in the URL as a query parameter.
The API response is in JSON format and includes information such as whether the sentence contains bad words, a list of detected bad words, normal words in the sentence, and the badness percentage.
For now, Nirmol is not providing any public API. You have to host your own. You can use the Cyclic platform to deploy your project. Use the link to get a 10$ free Credit!
Yes, you can integrate this API into your website, mobile app, or any software application that requires bad word detection functionality. Please share your work with us so that we can share with good people around the world!
No! Nirmol is an open-source project. We do not charge for the dataset or the source code. The only support we want is your feedback! Share your work, update and make the existing code better and that is what we love!

Source & Credits

Dataset

Technology

Documentation Template