Artificial Intelligence (AI) - A broad set of computer science technologies that enables machines to simulate human behavior.
Corpus - A collection of written or spoken text. A key component of NLP research, often used to train AI and ML.
HTML Character Entitites - Reserved characters in HTML like (non-breaking space) and & (ampersand).
HTML Tags - Building blocks of HTML, such as <b></b> to bold text.
Large Language Model - A specific type of ML designed for NLP tasks like understanding and generating human language.
Machine Learning (ML) - An application of AI that allows machines to learn from data using patterns and inference.
Named Entities - Classifiers like names, identification numbers, addresses, and emails.
Natural Language Processing - A subfield of computer science that uses ML to analyze and interpret natural language.
Personally Identifiable Information - Information that can be used to identify a person.
Regular Expressions - A sequence of characters represented as a text string meant to describe a search pattern.
Stop Words - Common, inconsequential words like "a", "of", and "it" that are filtered out from a corpus.
Tokenization - Breaking a larger text down into smaller units like sentences, words, or even syllables to facilitate analysis.