Spell Checker

Sarthak Keshari; Akarsh Reddy T; Rudrangshu Ghosh; Gayathri R; Neelanarayanan V

+1 (929) 600-8049

- Feedback
- Signup
- Submit Manuscript

e-Pub

Full Text

COJ Robotics & Artificial Intelligence

Spell Checker

Sarthak Keshari¹, Akarsh Reddy T¹, Rudrangshu Ghosh¹, Gayathri R¹ and Neelanarayanan V²*

¹Vellore Institute of Technology, Chennai, India

²Department of Cyber Physical Systems, VIT Chennai, India

*Corresponding author: Neelanarayanan V, Department of Cyber Physical Systems, VIT Chennai, India

Submission: April 08, 2022;Published: April 26, 2022

DOI: 10.31031/COJRA.2022.02.000526

ISSN:2832-4463
Volume2 Issue1

Abstract

Background: In today’s world of rapid digital communications, the significance of proper spelling and language has been on the back foot. To address this issue, a Spell Checker application would play a major role by making it easy and readily accessible for people looking to improve upon their spelling skills in digital/written communications.

Objective: To add spell checking and correction capabilities to any digital device by using Natural Language Processing techniques. It helps the user to reduce their typing load, by identifying any spelling errors and making it easier to check complete paragraphs to just simple words. The main goal of the spell checker is to provide a unified treatment of various spelling corrections. Firstly, the spell checking and the correcting problem will be formally described to provide a better understanding of these tasks. Spellchecker and corrector are a stand-alone application capable of processing a string of words or text. It is designed while keeping in mind Natural Language Processing principles, search and replace algorithms are adopted to fit into the domain of spell checker as well. Spell checking identifies the words that are valid in the language, in addition to misspell words in the language. Spell checking suggests alternative word(s) as the correct spelling when a misspell word is identified.

Keywords: Spell; Tokenize; Edit-Distance; Levenshtein distance

Introduction

In today’s world of fast-growing digitalization where every face of transformation requires textual and graphical data. Making it the utmost necessity that the points of weight to be communicated with clear and correct usage and spelling of words. Let’s understand how the spell check can play an important role in writing mails, letters, recommendations, etc. with a reliable use of words because ball and ball definitely makes a difference for someone reading an article or report.

A. Communication: One of the main reasons good spellings is important is that it facilitates communication. The standardization of spelling makes it much easier to understand the texts.

B. Avoid confusion: Good or/and correct spelling helps to avoid confusion. It can be difficult to understand what someone means if you haven’t checked the spelling correctly.

C. Future prospects: Thinking that spelling isn’t important anymore, but poor spelling could seriously affect your prospects and career. This article will talk about the spelling checking algorithm and will also highlight how proper pre-processing steps can help us drive a correct result. Article has also thrown light on the implementation of the same with a proper User interface, making it more reliable and easier to use technique. We will be discussing about pre-processing techniques like lower casing, tokenization, punctuation mark removal, stemming, lemmatization and then utilising the filtered data to apply spell check algorithm.

Text pre-processing

In this report, we have implemented six different types of text pre-processing techniques:

A. Lower Casing

B. Tokenization

C. Punctuation Mark Removal

D. Stop Word Removal

E. Stemming

F. Lemmatization

Lower casing

As the name itself suggests, that we are trying to convert our text data into lower case. This is needed because when we process text in sentences, the program will encounter both uppercase and lowercase letters such as ‘The’ and ‘the’ which mean different things to a computer even when they mean the same thing. In order to resolve this issue, we must convert all the words to lowercase to provide uniformity in the text. For ex:

String A = “Change the Sentence to Lower Case.”

A = A.lower ();

Therefore A = “change the sentence to lower case.”

Tokenization

The next text pre-processing step is Tokenization. With Tokenization, the paragraph is broken down into smaller units, such as sentences or words. Then each unit is considered as a single token. The main principle of tokenization is to try to understand the meaning of the text. Text by parsing the smaller units, or tokens, that make up the paragraph. Let’s start tokenizing paragraphs into sentences:

Sentence tokenizer

Here we take a paragraph as input and tokenize it into its constituting sentences. The result is a list of sentences stored in the variable named as ‘sentences’, containing each sentence of the paragraph. The length of the list tells us the total number of sentences. For example:

paragraph= “I like to read books. I hate if someone interrupts my reading routine.”

sentences = nltk.sent_tokenize(paragraph.lower())

print (sentences)

OUTPUT:

A. I like to read books.

B. I am fond of mystery and fiction.

C. I hate if someone interrupts my reading routine.

Word tokenizer

Similarly, we can also tokenize the paragraph into words The result is a list called “words” that contains each word in the paragraph. The length of the list gives us the total number of words present in our paragraph.

For example:

words = nltk.word_tokenize(paragraph.lower())

print (words)

Output:

A. I

B. like

C. to

D. read

...

Punctuation-word tokenizer

So now proceeding with the next step, we need to remove the punctuation marks from our word list.

For example:

Words = [I, like, to, read, books, .]

Words = alnum(words)

print(words)

OUTPUT -

[I, like, to, read, books]

Stemming

Stemming is a natural language processing technique that reduces inflection in words to their root forms, which aids in the preprocessing of text, words, and documents for text normalization.

For example: The words “changing”, “changes”, “change” are all reduced to “chang”

Lemmatization

In NLP, lemmatization takes context into account and converts the word into a meaningful base form. The converted word is called lemmas. We just saw how we can stem words down to their roots. However, Stemming a word does not give the assurance that the words thus formed are the part of the vocabulary of the language. This often results in meaningless words. To overcome this disadvantage, we use the concept of Lemmatization. For ex: chang is converted to change from its stemmatized version.

Methodology/Procedure/Principal Findings

Identify misspelled word

Let us consider an example, how would we get to know the word “walkng” is spelled incorrectly or correctly? If a word is spelled correctly, the word will be found in a dictionary, and if it’s not there, it’s probably a misspelled word. Hence, whenever a word is not found in a dictionary then we will flag it for correction.

Finding edit distance (levenshtein distance)

Spelling and typing errors are common in human-made documentation. The problem of detecting and automatically correcting errors in words is a major research challenge. The word error can be divided into two types, that is, the non-word error and the real-word error. Missing letters, extra letters, misspelled letters, or jumbled up letters. We are using minimum edit distance. An edit is one of the operations performed on a string to transform it into another string and n is nothing else than the edit distance, which is an edit distance like 1, 2, 3 etc. that counts the number of edit operations that to be performed. Therefore, the edit distance ‘n’ tells us that how many operations are away from one string to the other string. Following are the different types of edits: -

A. Insert (will add a letter)

B. Delete (will remove a letter)

C. Switch (it will swap two nearby letters)

D. Replace (replace one character with another)

With these four edits, we will be able to modify any string.

Therefore, combining edits allows us to find a list of all possible strings that require n edits (Figure 1). In this above example as it is a non-real word error, we have to perform 3 edit operations i.e., substitution, again substitution and insertion. Now if we have a standard cost of each edit as 1, then the edit distance is 3. Similarly, Figure 2 the only limitation of this edit distance is that it can only solve non-words which are words not found in the dictionary

Figure 1:

Figure 2:

Candidates filtering

Here we want to consider only correctly spelled real words from our generated candidate list so we can compare the words to a known dictionary (as we did in the first step) and then filter out the words in our generated candidate list that do not appear in the known “dictionary”. Every word is added to a candidate list. We repeat this procedure for every word for a second time to get candidates with bigger edit distance (for two-error cases). Each candidate is estimated with unigram language model. For each vocabulary word frequencies are pre-calculated, based on some big text collections. The candidate word with highest frequency is taken as an answer. This approach is called Norvig’s approach.

Implementation/Discussion

The implementation of spell checker has involved the use of python and its web framework Django, built over Bootstrap, HTML, CSS and JavaScript which in turn has helped to make the UI more dynamic and served its way of serving the spell check results on the webpage.

The website has three modules –

A. Word Check

B. Sentence Check

C. Document Check

Word check

This feature helps you to check spelling of the word you enter. Enter a word you want to correct then press spell check to get the corrected word (Figure 3).

Figure 3:

Sentence check

This feature helps you to check spelling of a sentence you enter. Enter a sentence you want to correct then press spell check to get the correct sentence based on different pre-processing techniques (Figure 4).

Figure 4:

Document check

This feature helps you to check spelling of a document. Choose a document you want to correct then press upload all the spelling errors in the document will be corrected (Figure 5).

Figure 5:

Result

Provided raw textual data getting it pre-processed and spell check algorithm is applied to it which results in spell checked data. The spell checker takes into consideration word check, sentence check and document check. All the above implementation takes into consideration the pre-processing steps and methodologies.

Future Work

Providing the spell check to work in real-time with a lesser time and space complexity. Making the algorithm work on large data in minimal amount of time. Finding probability of the found corrected spelling in order to find the best fit word for the sentence even at the cost of increasing the edit distance making the algorithm more reliable.

Conclusion

Hence, we can conclude that the use of correct spelling is a fundamental element of communication. Thereby implementing the correction algorithm for the same making the use of edit distance (Levenshtein distance) where a word with minimum edit distance is selected from the dictionary will definitely play a major role in spelling correction. Spell checker provide facility to reduce the typing work and avoid spelling errors as while entering the words.

Literature Survey

This paper elaborates on a Windows based Spell Checker application to implement Spell Checking further augmented by using data structure like tries for storing dictionary and edit distance algorithm for spelling correction which provides advantage of autosuggesting, so that queries need not be typed in their entirety, queries need not be remembered verbatim and wrongly typed queries can be overridden without recomposing them [1].

The aim of this paper is the behaviour analysis when a spell checker was integrated as an extra pre-process during the first stage of the test mining. Different models were analysed, choosing the most complete one considering the pre-processes as the initial part of the text mining process. Execution times of algorithms were analysed to test the efficiency of the TM pre-processes with two variants, as well as with and without spell checker. Time values were obtained directly when algorithms were executed. The spell checker introduction as an extra pre-process during the first TM stage produces a time reduction. The spelling correction benefits the TM stemming pre-process, but also IR and NLP processes, because those are part of the TM application areas [2].

NLTK, the Natural Language Toolkit, is a suite of Python modules providing many NLP data types, processing tasks, corpus samples and readers. The paper discusses its usefulness for teaching NLP and its wide arrays of functionalities. The paper then discusses various functionalities of NLP such as Tokenization and Stemming, Tagging and Chunking & Parsing. A significant fraction of any NLP course is made up of fundamental data structures and algorithms. These are usually taught with the help of formal notations and complex diagrams [3].

The major drawback of spelling correction systems out of context is that Spell Check systems return multiple solutions in cases having the same importance. To remedy this problem the authors have introduced weights of the editing operations errors so as to improve the scheduling of solutions returned by the Levenshtein distance. The paper discusses in length about how to resolve different issues regarding assigning weights different results. From the results obtained, the authors note that their technique helps improve satisfactorily scheduling solutions. However, they also note the drawback that their test corpus size is limited [4-7].

References

Bhaire VV, Jadhav AA, Pashte PA, Magdum PG (2015) Spell checker. International Journal of Scientific and Research Publications 5(4): 5-7.
Espino QJ, González RRM, Guevara LA (2018) Advantages of using a spell checker in text mining pre-processes. Journal of Computer and Communications 6(11): 43-54.
Bird S (2006) NLTK: the natural language toolkit. In Proceedings of the COLING/ACL 2006 Interactive Presentation Sessions pp. 69-72.
Lhoussain AS, Hicham G, Abdellah YO (2015) Adaptating the levenshtein distance to contextual spelling correction. International Journal of Computer Science and Applications 12(1): 127-33.
https://pypi.org/project/textdistance/#:~:text=TextDistance%20%20python%20library%20for%20comparing%20distance%20between,speed.%20Normalized%20compression%20distance%20with%20different%20compression%20algorit
https://towardsdatascience.com/autocorrect-8c33f3b472a0
http://bytecontinnum.com/2017/06/natural-language-searches-lessons-spellcheck-autocorrect/

© 2022 Neelanarayanan V. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and build upon your work non-commercially.

Submit Query

PubMed Indexed Articles

Track Your Article

Editor In Chief

Hirotada TSUJII

Ph.D in Agriculture from Faculty of Agriculture, Tohoku University

Approaches in Poultry, Dairy & Veterinary Sciences

Maria Kuman

Research Professor, PhD, Holistic Research Institute

Advances in Complementary & Alternative Medicine

Tomasz Karski

MD PhD, Professor, Vincent Pol University

Orthopedic Research Online Journal

Jiexiong Feng

Professor, Chief Doctor, Director of Department of Pediatric Surgery, Associate Director of Department of Surgery, Doctoral Supervisor Tongji hospital, Tongji medical college, Huazhong University of Science and Technology

Research in Pediatrics & Neonatology

Muhammad Atiqullah

Senior Research Engineer and Professor, Center for Refining and Petrochemicals, Research Institute, King Fahd University of Petroleum and Minerals (KFUPM), Dhahran, Saudi Arabia

Research & Development in Material Science

Ian James Martins

Fellow of International Agency for Standards and Ratings (IASR), Edith Cowan University, Sarich Neuroscience Research Institute

Advancements in Case Studies

Thomas F George

Chancellor Emeritus / Professor Emeritus of Chemistry and Physics, University of Missouri–St. Louis

Annals of Chemical Science Research

Jose Crisologo de Sales Silva

Ph.D in Science from the Federal University of Alagoas, UFAL, Brazil

Novel Research in Sciences

Naglaa Sami Adbel Aziz Mahmoud

Assistant Professor in College of Architecture, Art and Design

Academic Journal of Engineering Studies

Tong-Ching Tom Wu

Interim Dean, College of Education and Health Sciences, Director of Biomechanics Laboratory, Sport Science Innovation Program, Bridgewater State University

Research & Investigations in Sports Medicine

Dr. Jose Luis Turabian

Professor of numerous training courses in Family Medicine

Associative Journal of Health Sciences

Dariusz Jacek Jakóbczak

Assistant Professor, Department of Electronics and Computer Science

COJ Electronics & Communications

Önder Pekcan

Emeritus Professor of Physics, Kadir Has University, Turkey

Polymer Science: Peer Review Journal

Member In

View All...

Quick Links

Editorial Board Registrations

×

Join as Editor

Join as Associate Editor
Submit your Article
Best Paper of the Volume
Reprints
Refer a Friend

×

Refer a Friend

Suggested By

Referrer Details
Advertise With Us

×

Advertise With Us

Our Recent Edition

Top Editors

Zhengcai Lou

Wenzhou Medical University, China
Ya Lie Ku

Fooyin University, Taiwan
Volkan Sarper Erikci

Saglik Bilimleri University, Turkey
Tomasz Karski

Vincent Pol University, Poland
Thamil Selvam

National Defence University of Malaysia, Malaysia
Tarik Baykara

Dogus University, Turkey
Steven Smith

Hope College, USA
Stanislav Grigoriev

Russian Academy of Sciences, Russia
Shi Zhou

Southern Cross University, Australia
Shewikar Farrag

Umm Al-Qura University, Saudi Arabia
Ray Marks

City University of New York, USA
Praveen K Maghelal

Khalifa University of Science & Technology, United Arab Emirates
Peng Yu

Hebei Normal University, China
Nawal Mohamed Khalafallah

Alexandria University, Egypt
N K Kishore

Indian Institute of Technology Kharagpur, India
Muzzalupo Innocenzo

Council for Agriculture Research and Analysis of Agri Economy (CREA), Italy
Muhammad Atiqullah

King Fahd University of Petroleum and Minerals, Saudi Arabia
Mohamed A Rashed

King Abdulaziz University, Saudi Arabia
Maurice E Morgenstein

University of Oregon, USA
Martin Sweatman

University of Edinburgh, Scotland
Maria Kuman

University of Tennessee, USA
Manuel Velasco

Central University of Venezuela, Venezuela
Majid Monajjemi

Islamic Azad University Central Tehran Branch, Iran
Luisetto Mauro

Tourin University, Italy
Lloyd Arthur Jenkins

Teaching & Public Speaking, Spain
Leonardo Milella

Paeditric Hospital "Giovanni XXIII", Italy
Kanakis Dimitrios

University of Nicosia, Cyprus
Jose Luis Clua Espuny

Universidad Miguel Hernández de Elche, Spain
John Korstad

Oral Roberts University, USA
Jinliang Zhang

Beijing Normal University, China
Irina Koretsky

Howard University, USA
Ian James Martins

Edith Cowan University, Australia
Hamid Yahiya Hussain

Dubai Health Authority, UAE
Gundu HR Rao

University of Minnesota, USA
GP Karmakar

Indian Institute of Technology Kharagpur, India
Ghassan George Haddad

Serhal Hospital, Lebanon
George Gregory Buttigieg

University of Malta, Malta
Fumihiko Hinoshita

National Center for Global Health and Medicine, Japan
Freida Pemberton

Molloy College, USA
Francisco Welington de Sousa Lima

Federal University of Piauí, Brazil
Florian Bert

Krankenhaus Nordwest Hospital, Germany
Fathi Habashi

Laval University, Canada
Dora Alicia Cortes Hernandez

Cinvestav-Unidad Saltillo, Mexico
Daniel Kinem

UPMC Hamot Neuroscience Institute, USA
Conxita Mestres Miralles

Ramon Llull University, Spain
Barry Kraynack

White Bear Associates, LLC, USA
Arkady S Voloshin

Lehigh University, USA
Alireza Heidari

California Southern University, USA
Alex Guskov

Institute of Solid State Physics of RAS, Russia
Alan Diego Briem Stamm

University of Buenos Aires, Argentina
Ahmed Nasr Ghanem

Mansoura University, Egypt
Afaf K El Ansary

King Saud University, Saudi Arabia
A Bernardes

University of Coimbra, Portugal

Financial Support

Latest e-Books

Latest Video

© 2017 Crimson Publishers, All rights reserved. No part of this content may be reproduced or transmitted in any form or by any means as per the standard guidelines of fair use. Creative Commons License Open Access by Crimson Publishers is licensed under

a Creative Commons Attribution 4.0 International License. Based on a work at www.crimsonpublishers.com. Best viewed in

| Above IE 9.0 version

Scroll

Full Text

COJ Robotics & Artificial Intelligence

Spell Checker

Abstract

Introduction

Methodology/Procedure/Principal Findings

Implementation/Discussion

Result

Future Work

Conclusion

Literature Survey

References

PubMed Indexed Articles

Track Your Article

Editor In Chief

Member In

Signup for Newsletter

Quick Links

Our Recent Edition

Top Editors

Financial Support

Sponsors

Latest e-Books

Latest Video

Reprints