How to get started with natural language processing _ natural language processing knowledge

1. What is NLP?

Natural language processing is an important direction in the field of computer science and artificial intelligence. It studies various theories and methods that enable effective communication between humans and computers in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Therefore, research in this field will involve natural language, the language that people use every day, so it is closely related to the study of linguistics, but there are important differences. Natural language processing is not a general study of natural language, but rather a computer system that can effectively implement natural language communication, especially a software system therein. It is therefore part of computer science.

Natural language processing, which is to achieve natural language communication between human and computer, to achieve natural language understanding and natural language generation is very difficult. The root cause of the difficulties is the wide variety of ambiguities or ambiguities that exist widely in natural language texts and at all levels of dialogue. Communicating with computers in natural language is something that people have been pursuing for a long time. Because it has obvious practical significance, but also has important theoretical significance: people can use the computer in their most accustomed language, without spending a lot of time and energy to learn various computer languages ​​that are not very natural and habitual; It can also be used to further understand human language and intelligence mechanisms.

The ability model, usually a model based on linguistic rules, is based on the assumption that there is a general rule of grammar in the human brain. The language is derived from the language ability of the human brain. The language model is established by establishing a manually edited language. A set of rules to simulate this innate language ability. Also known as the "rationalist" language model.

Application models, specific language models built on different language processing applications, usually based on statistical models. Also known as the "empirical" language model, using large-scale real corpus to obtain statistical information on language units at all levels of language, using statistical statistical reasoning techniques based on statistical information in lower-level language units to calculate higher-level language units Statistics.

How to get started with natural language processing _ natural language processing knowledge

The basic structure of natural language processing: word segmentation => part of speech tag => Parser

1, participle

Words are the smallest meaningful language components that can be independently active. English words are spaces with natural delimiters, while Chinese is word-based writing units. There is no obvious distinction between words. Therefore, Chinese Word analysis is the basis and key of Chinese information processing.

Chinese word segmentation techniques can be divided into three categories: word segmentation methods based on dictionary and lexicon matching; word segmentation methods based on word frequency statistics and word segmentation methods based on knowledge understanding.

2. Part-of-Speech tagging or POS tagging, also known as word tagging or abbreviated tagging, refers to a procedure for labeling each word in a segmentation result with a correct part of speech, that is, determining that each word is a noun. , verb, adjective or other part of speech process. In Chinese, part-of-speech tagging is relatively simple, because the vocabulary of Chinese vocabulary is relatively rare. Most words have only one part of speech, or the most frequently occurring part of speech is much higher than the second part of speech. It is said that by selecting the highest frequency part of speech, the Chinese part-of-speech tagging program with 80% accuracy can be achieved. Use HMM to achieve higher accuracy word-of-speech tagging.

3. Name entity identification

Named EnTIty RecogniTIon (NER), also known as "name identification", refers to an entity that has a specific meaning in the text, including person name, place name, institution name, proper noun, and so on.

(1) entity boundary identification; (2) identification of entity category (person name, place name, institution name or other)

Named entity recognition is an important basic tool for information extraction, question and answer systems, syntax analysis, machine translation, and metadata annotation for SemanTIc Web.

Rule-based and dictionary-based approach (almost all participating members of the MUC-6 conference use a rule-based approach) that requires experts to develop rules with high accuracy, but relies on feature areas and poor portability;

Based on the statistical method, mainly using HMM, MEMM, and CRF, the difficulty lies in feature selection. This method can obtain good robustness and flexibility without much manual intervention and domain limitation, but requires a large number of annotation sets.

The hybrid method, which combines rules and statistics, and combines various statistical methods, is currently the mainstream method.

Features: Contextual Information + Word Formation

4, refers to the digestion

Refers to a common linguistic phenomenon. In general, there are two types of reference: anaphora and co-finger.

Anaphora means that the current anaphora has a close semantic relationship with the words, phrases or sentences (sentence groups) appearing above, and refers to the contextual semantics, which may refer to different entities in different locales. Asymmetric and non-transitive;

Co-referring mainly refers to two nouns (including pronouns, noun phrases) pointing to the same reference body in the real world, and this kind of referencing is still valid.

At present, the research on digestion is mainly focused on equivalence relations, and only considers whether two words or phrases indicate the same entity in the real world, that is, co-finger digestion.

There are three typical forms of Chinese referentials:

(1) Pronoun, for example: Li Ming is afraid of Gao Mama staying at home alone

Lonely, he moved the TV at home.

(2) Demonstrative pronouns (demonstraTIve), for example: Many people want to leave something for the child, which is understandable, but not completely correct.

(3) Definite description, for example: Trade sanctions have become the habitual stick of the US government to China. Is this big stick really as good as the US government hopes?

5, text classification

a text (the following basically does not distinguish between the meaning of the words "text" and "document")

A document is classified into one or several of a few predefined categories, and automatic classification of text is accomplished using a computer program.

6, question and answer system

The Question Answering System (QA) is an advanced form of information retrieval system that answers users' questions in natural language in an accurate and concise natural language.

According to the type of problem, it can be divided into two types: limited domain and open domain. According to the data type, it can be divided into: structural type and unstructured type (text). According to the answer type, it can be divided into two types: extractive and production.

Question Analysis - "Document Search -" Answer Extraction (Verification)

How to get started with natural language processing _ natural language processing knowledge

Natural Language Processing Toolkit:

The Chinese toolkit is LTP (Language Technology Platform) developed by HIT-SCIR (Harbin Institute of Technology Social Computing and Information Retrieval Research Center).

English (python):

Pattern - simpler to get started than NLTK

· chardet - character encoding detection

· pyenchant - easy access to dictionaries

· scikit-learn - has support for text classification

· unidecode - because ascii is much easier to deal with

Master the following tools:

CRF++

GIZA

Word2Vec


Natural language processing recommended study books

Nowadays, natural language processing relies on statistical knowledge. The following four standard books in the field of natural language processing are recommended.

"The Beauty of Mathematics", this writing is particularly popular and vivid, I believe that you will not feel dull

Statistical Learning Method

"Overview of Natural Language Processing"

Statistical Basics of Natural Language Processing

Natural Language Understanding

Servo Motor Replacement

Servo Motor Replacement Product Features:

1. Small footprint, high precision, high response, bus control, simple wiring, rich alarm functions, and lower cost;

2. All-copper coil, heat dissipation and waterproof; equipped with instantaneous power-off function to reduce downtime loss;

3. Customized motion control scheme: source manufacturers supply, strict quality inspection and guarantee; sufficient supply, short delivery time;

4. The speed can be adjusted arbitrarily within the rated speed according to the requirements of the computer program.

5. Optimize the electromagnetic scheme to improve the efficiency by more than 96%. Using superconducting materials, the temperature rise is reduced by 10%

6. The new movement has successfully realized the miniaturization of the motor.

Servo Motor Replacement,Robot Servo Motor,Digital Servo Motor,Perfect Pass Servo Motor Replacement

Kassel Machinery (zhejiang) Co., Ltd. , https://www.kasselservo.com