自然语言处理（NLP）学习のspaCy工具的使用

时间：2021-05-21 18:21:52 阅读：33 评论：0 收藏：0 [点我收藏+]

spaCy官方学习文档

spaCy简介绍

nlp对象

# Import the English language class
from spacy.lang.en import English
?
# Create the nlp object
nlp = English()

包含处理管道
包括特定于语言的令牌化规则等。

Doc对象

# Created by processing a string of text with the nlp object
doc = nlp("Hello world!")
?
# Iterate over tokens in a Doc
for token in doc:
    print(token.text)

Token对象

技术分享图片

doc = nlp("Hello world!")
?
# Index into the Doc to get a single Token
token = doc[1]
?
# Get the token text via the .text attribute
print(token.text)

Span对象

技术分享图片

doc = nlp("Hello world!")
?
# A slice from the Doc is a Span object
span = doc[1:3]
?
# Get the span text via the .text attribute
print(span.text)

词汇属性

doc = nlp("It costs $5.")
print("Index:   ", [token.i for token in doc])
print("Text:    ", [token.text for token in doc])
?
print("is_alpha:", [token.is_alpha for token in doc])
print("is_punct:", [token.is_punct for token in doc])
print("like_num:", [token.like_num for token in doc])

练习(提供几个例子)

入门

English从中导入类spacy.lang.en并创建nlp对象。
创建一个doc并打印其文本。

from spacy.lang.en import English
nlp = English()
doc = nlp("This is a sentense.")
print(doc.text)

文档跨度和令牌

调用nlp字符串时，spaCy首先标记文本并创建文档对象。

步骤1

导入English语言类并创建nlp对象。
处理文本并实例化Doc变量中的对象doc。

选择的第一个标记Doc并打印text。

# Import the English language class and create the nlp object
from spacy.lang.en import English
?
nlp = English()
?
# Process the text
doc = nlp("I like tree kangaroos and narwhals.")
# Select the first token
first_token = doc[0]
?
# Print the first token‘s text
print(first_token.text)

步骤2

导入English语言类并创建nlp对象。
处理文本并实例化Doc变量中的对象doc。
Doc为令牌“ tree kangaroos”和“ tree kangaroos and narwhals”创建一个切片。

# Import the English language class and create the nlp object
from spacy.lang.en import English
?
nlp = English()
?
# Process the text
doc = nlp("I like tree kangaroos and narwhals.")
?
# A slice of the Doc for "tree kangaroos"
tree_kangaroos = doc[2:4]
print(tree_kangaroos.text)
?
# A slice of the Doc for "tree kangaroos and narwhals" (without the ".")
tree_kangaroos_and_narwhals = doc[2:6]
print(tree_kangaroos_and_narwhals.text)

词汇属性

在此示例中，将使用spaCyDoc和Token对象以及词法属性在文本中查找百分比。您将寻找两个后续标记：一个数字和一个百分号。

使用like_num令牌属性检查令牌中的令牌是否doc 类似于数字。
获取文档中当前令牌之后的令牌。中的下一个标记的索引doc为token.i + 1。
检查下一个令牌的text属性是否为百分号“％”。

from spacy.lang.en import English
?
nlp = English()
?
# Process the text
doc = nlp(
    "In 1990, more than 60% of people in East Asia were in extreme poverty. "
    "Now less than 4% are."
)
?
# Iterate over the tokens in the doc
for token in doc:
    # Check if the token resembles a number
    if token.like_num:
        # Get the next token in the document
        next_token = doc[token.i + 1]
        # Check if the next token‘s text equals "%"
        if next_token.text == "%":
            print("Percentage found:", token.text)

统计模型

什么是统计模型

启用spaCy以预测上下文中的语言属性
- 词性标签
- 句法依存关系
- 命名实体
训练有标签的示例文本
可以使用更多示例进行更新以微调预测

模型包

$ python -m spacy download en_core_web_sm

import spacy
?
nlp = spacy.load("en_core_web_sm")

二进制权重
词汇
元信息（语言，管道）

预测词性标签

import spacy
?
# Load the small English model
nlp = spacy.load("en_core_web_sm")
?
# Process a text
doc = nlp("She ate the pizza")
?
# Iterate over the tokens
for token in doc:
    # Print the text and the predicted part-of-speech tag
    print(token.text, token.pos_)

预测句法依存关系

for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)

技术分享图片

Label	Description	Example
nsubj	nominal subject	She
dobj	direct object	pizza
det	determiner (article)	the

预测命名实体

# Process a text
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
?
# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)

Tips: spacy.explain方法

快速了解常见标签

spacy.explain("GPE")
‘Countries, cities, states‘
spacy.explain("NNP")
‘noun, proper singular‘
spacy.explain("dobj")
‘direct object‘

练习

模型包

您可以加载到spaCy的模型包中未包含哪些内容？

A、一个元文件，包括语言，管道和许可证。

B、二进制权重进行统计预测。

C、模型训练所依据的标记数据。

D、模型词汇量及其哈希值的字符串。

加载模型

使用spacy.load加载英语小模型"en_core_web_sm"。
处理文本并打印文档文本。

import spacy
?
# Load the "en_core_web_sm" model
nlp = spacy.load("en_core_web_sm")
?
text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"
?
# Process the text
doc = nlp(text)
?
# Print the document text
print(doc.text)

预测语言注释

现在，您将尝试使用spaCy的预训练模型包之一，并查看其运行中的预测。随意在您自己的文本上尝试一下！要了解标签或标签的含义，可以spacy.explain在循环中调用。例如：spacy.explain("PROPN")或spacy.explain("GPE")。

第1部分

使用nlp对象处理文本并创建一个doc。

对于每个令牌，打印令牌文本，令牌的.pos_（词性标签）和令牌的.dep_（依赖性标签）。

import spacy
?
nlp = spacy.load("en_core_web_sm")
?
text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"
?
# Process the text
doc = nlp(text)
?
for token in doc:
    # Get the token text, part-of-speech tag and dependency label
    token_text = token.text
    token_pos = token.pos_
    token_dep = token.dep_
    # This is for formatting only
    print(f"{token_text:<12}{token_pos:<10}{token_dep:<10}")

第2部分

处理文本并创建一个doc对象。

遍历doc.ents并打印实体文本和label_属性。

import spacy
?
nlp = spacy.load("en_core_web_sm")
?
text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"
?
# Process the text
doc = nlp(text)
?
# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)

在上下文中预测命名实体

模型是统计的，并不总是正确的。他们的预测是否正确取决于训练数据和您要处理的文本。让我们看一个例子。

使用nlp对象处理文本。
遍历实体并打印实体文本和标签。

看起来该模型没有预测“ iPhone X”。手动为这些标记创建跨度。

import spacy
?
nlp = spacy.load("en_core_web_sm")
?
text = "Upcoming iPhone X release date leaked as Apple reveals pre-orders"
?
# Process the text
doc = nlp(text)
?
# Iterate over the entities
for ent in doc.ents:
    # Print the entity text and label
    print(ent.text, ent.label_)
?
# Get the span for "iPhone X"
iphone_x = doc[1:3]
?
# Print the span text
print("Missing entity:", iphone_x.text)

基于规则的匹配

为什么不只是正则表达式

匹配Doc对象，而不仅仅是字符串
匹配令牌和令牌属性
使用模型的预测
例如：“ duck”（动词）与“ duck”（名词）

匹配模式

词典列表，每个令牌一个
匹配确切的令牌文本

[{"TEXT": "iPhone"}, {"TEXT": "X"}]

匹配词汇属性

[{"LOWER": "iphone"}, {"LOWER": "x"}]

匹配任何令牌属性

[{"LEMMA": "buy"}, {"POS": "NOUN"}]

匹配器的使用

import spacy
?
# Import the Matcher
from spacy.matcher import Matcher
?
# Load a model and create the nlp object
nlp = spacy.load("en_core_web_sm")
?
# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)
?
# Add the pattern to the matcher
pattern = [{"TEXT": "iPhone"}, {"TEXT": "X"}]
matcher.add("IPHONE_PATTERN", None, pattern)
?
# Process some text
doc = nlp("Upcoming iPhone X release date leaked")
?
# Call the matcher on the doc
matches = matcher(doc)

import spacy
# 导入 spaCy Matcher
from spacy.matcher import Matcher
from spacy.lang.en import English
nlp = spacy.load("en_core_web_sm")
# 创建nlp对象
# doc = nlp("I love little sheep.")
# doc2 = nlp("Indians spent over $71 billion on clothes in 2018")
?
# # spaCy处理管道
# print(nlp.pipe_names)
?
# # 词性标记
# for token in doc:
#     # Print the token and its part-of-speech tag
#     print(token.text, "-->", token.pos_)
# print(spacy.explain("PUNCT"))
?
# # 依存分析
# for token in doc:
#     print(token.text, "-->", token.dep_)
?
# # 命名实体识别
# for ent in doc2.ents:
#     print(ent.text, ent.label_)
?
# 基于规则的spaCy匹配
?
?
# 用spaCy词汇表初始化Matcher
matcher = Matcher(nlp.vocab)
?
doc = nlp("Some people start their day with lemon water")
# 定义规则
pattern = [{‘TEXT‘: ‘lemon‘}, {‘TEXT‘: ‘water‘}]
?
# 添加规则
matcher.add(‘rule_1‘, [pattern])
matchers = matcher(doc)
print(matchers)

自然语言处理（NLP）学习のspaCy工具的使用

原文：https://www.cnblogs.com/H-saku/p/14794622.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)