spaCy官方学习文档spaCy简介绍nlp对象Doc对象Token对象Span对象词汇属性练习(提供几个例子)入门文档跨度和令牌步骤1步骤2词汇属性统计模型什么是统计模型模型包预测词性标签预测句法依存关系预测命名实体练习模型包加载模型预测语言注释第1部分第2部分在上下文中预测命名实体基于规则的匹配为什么不只是正则表达式匹配模式匹配器的使用
# Import the English language class
from spacy.lang.en import English
?
# Create the nlp object
nlp = English()
包含处理管道
包括特定于语言的令牌化规则等。
# Created by processing a string of text with the nlp object
doc = nlp("Hello world!")
?
# Iterate over tokens in a Doc
for token in doc:
print(token.text)
doc = nlp("Hello world!")
?
# Index into the Doc to get a single Token
token = doc[1]
?
# Get the token text via the .text attribute
print(token.text)
doc = nlp("Hello world!")
?
# A slice from the Doc is a Span object
span = doc[1:3]
?
# Get the span text via the .text attribute
print(span.text)
doc = nlp("It costs $5.")
print("Index: ", [token.i for token in doc])
print("Text: ", [token.text for token in doc])
?
print("is_alpha:", [token.is_alpha for token in doc])
print("is_punct:", [token.is_punct for token in doc])
print("like_num:", [token.like_num for token in doc])
English
从中导入类spacy.lang.en
并创建nlp
对象。
创建一个doc
并打印其文本。
from spacy.lang.en import English
nlp = English()
doc = nlp("This is a sentense.")
print(doc.text)
调用nlp
字符串时,spaCy首先标记文本并创建文档对象。
导入English
语言类并创建nlp
对象。
处理文本并实例化Doc
变量中的对象doc
。
选择的第一个标记Doc
并打印text
。
# Import the English language class and create the nlp object
from spacy.lang.en import English
?
nlp = English()
?
# Process the text
doc = nlp("I like tree kangaroos and narwhals.")
# Select the first token
first_token = doc[0]
?
# Print the first token‘s text
print(first_token.text)
导入English
语言类并创建nlp
对象。
处理文本并实例化Doc
变量中的对象doc
。
Doc
为令牌“ tree kangaroos”和“ tree kangaroos and narwhals”创建一个切片。
# Import the English language class and create the nlp object
from spacy.lang.en import English
?
nlp = English()
?
# Process the text
doc = nlp("I like tree kangaroos and narwhals.")
?
# A slice of the Doc for "tree kangaroos"
tree_kangaroos = doc[2:4]
print(tree_kangaroos.text)
?
# A slice of the Doc for "tree kangaroos and narwhals" (without the ".")
tree_kangaroos_and_narwhals = doc[2:6]
print(tree_kangaroos_and_narwhals.text)
在此示例中,将使用spaCyDoc
和Token
对象以及词法属性在文本中查找百分比。您将寻找两个后续标记:一个数字和一个百分号。
使用like_num
令牌属性检查令牌中的令牌是否doc
类似于数字。
获取文档中当前令牌之后的令牌。中的下一个标记的索引doc
为token.i + 1
。
检查下一个令牌的text
属性是否为百分号“%”。
from spacy.lang.en import English
?
nlp = English()
?
# Process the text
doc = nlp(
"In 1990, more than 60% of people in East Asia were in extreme poverty. "
"Now less than 4% are."
)
?
# Iterate over the tokens in the doc
for token in doc:
# Check if the token resembles a number
if token.like_num:
# Get the next token in the document
next_token = doc[token.i + 1]
# Check if the next token‘s text equals "%"
if next_token.text == "%":
print("Percentage found:", token.text)
启用spaCy以预测上下文中的语言属性
词性标签
句法依存关系
命名实体
训练有标签的示例文本
可以使用更多示例进行更新以微调预测
$ python -m spacy download en_core_web_sm
import spacy
?
nlp = spacy.load("en_core_web_sm")
二进制权重
词汇
元信息(语言,管道)
import spacy
?
# Load the small English model
nlp = spacy.load("en_core_web_sm")
?
# Process a text
doc = nlp("She ate the pizza")
?
# Iterate over the tokens
for token in doc:
# Print the text and the predicted part-of-speech tag
print(token.text, token.pos_)
for token in doc:
print(token.text, token.pos_, token.dep_, token.head.text)
Label | Description | Example |
---|---|---|
nsubj | nominal subject | She |
dobj | direct object | pizza |
det | determiner (article) | the |
# Process a text
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
?
# Iterate over the predicted entities
for ent in doc.ents:
# Print the entity text and its label
print(ent.text, ent.label_)
Tips: spacy.explain方法
快速了解常见标签
spacy.explain("GPE")
‘Countries, cities, states‘
spacy.explain("NNP")
‘noun, proper singular‘
spacy.explain("dobj")
‘direct object‘
您可以加载到spaCy的模型包中未包含哪些内容?
A、一个元文件,包括语言,管道和许可证。
B、二进制权重进行统计预测。
C、模型训练所依据的标记数据。
D、模型词汇量及其哈希值的字符串。
使用spacy.load
加载英语小模型"en_core_web_sm"
。
处理文本并打印文档文本。
import spacy
?
# Load the "en_core_web_sm" model
nlp = spacy.load("en_core_web_sm")
?
text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"
?
# Process the text
doc = nlp(text)
?
# Print the document text
print(doc.text)
现在,您将尝试使用spaCy的预训练模型包之一,并查看其运行中的预测。随意在您自己的文本上尝试一下!要了解标签或标签的含义,可以spacy.explain
在循环中调用。例如:spacy.explain("PROPN")
或spacy.explain("GPE")
。
使用nlp
对象处理文本并创建一个doc
。
对于每个令牌,打印令牌文本,令牌的.pos_
(词性标签)和令牌的.dep_
(依赖性标签)。
import spacy
?
nlp = spacy.load("en_core_web_sm")
?
text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"
?
# Process the text
doc = nlp(text)
?
for token in doc:
# Get the token text, part-of-speech tag and dependency label
token_text = token.text
token_pos = token.pos_
token_dep = token.dep_
# This is for formatting only
print(f"{token_text:<12}{token_pos:<10}{token_dep:<10}")
处理文本并创建一个doc
对象。
遍历doc.ents
并打印实体文本和label_
属性。
import spacy
?
nlp = spacy.load("en_core_web_sm")
?
text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"
?
# Process the text
doc = nlp(text)
?
# Iterate over the predicted entities
for ent in doc.ents:
# Print the entity text and its label
print(ent.text, ent.label_)
模型是统计的,并不总是正确的。他们的预测是否正确取决于训练数据和您要处理的文本。让我们看一个例子。
使用nlp
对象处理文本。
遍历实体并打印实体文本和标签。
看起来该模型没有预测“ iPhone X”。手动为这些标记创建跨度。
import spacy
?
nlp = spacy.load("en_core_web_sm")
?
text = "Upcoming iPhone X release date leaked as Apple reveals pre-orders"
?
# Process the text
doc = nlp(text)
?
# Iterate over the entities
for ent in doc.ents:
# Print the entity text and label
print(ent.text, ent.label_)
?
# Get the span for "iPhone X"
iphone_x = doc[1:3]
?
# Print the span text
print("Missing entity:", iphone_x.text)
匹配Doc
对象,而不仅仅是字符串
匹配令牌和令牌属性
使用模型的预测
例如:“ duck”(动词)与“ duck”(名词)
词典列表,每个令牌一个
匹配确切的令牌文本
[{"TEXT": "iPhone"}, {"TEXT": "X"}]
匹配词汇属性
[{"LOWER": "iphone"}, {"LOWER": "x"}]
匹配任何令牌属性
[{"LEMMA": "buy"}, {"POS": "NOUN"}]
import spacy
?
# Import the Matcher
from spacy.matcher import Matcher
?
# Load a model and create the nlp object
nlp = spacy.load("en_core_web_sm")
?
# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)
?
# Add the pattern to the matcher
pattern = [{"TEXT": "iPhone"}, {"TEXT": "X"}]
matcher.add("IPHONE_PATTERN", None, pattern)
?
# Process some text
doc = nlp("Upcoming iPhone X release date leaked")
?
# Call the matcher on the doc
matches = matcher(doc)
import spacy
# 导入 spaCy Matcher
from spacy.matcher import Matcher
from spacy.lang.en import English
nlp = spacy.load("en_core_web_sm")
# 创建nlp对象
# doc = nlp("I love little sheep.")
# doc2 = nlp("Indians spent over $71 billion on clothes in 2018")
?
# # spaCy处理管道
# print(nlp.pipe_names)
?
# # 词性标记
# for token in doc:
# # Print the token and its part-of-speech tag
# print(token.text, "-->", token.pos_)
# print(spacy.explain("PUNCT"))
?
# # 依存分析
# for token in doc:
# print(token.text, "-->", token.dep_)
?
# # 命名实体识别
# for ent in doc2.ents:
# print(ent.text, ent.label_)
?
# 基于规则的spaCy匹配
?
?
# 用spaCy词汇表初始化Matcher
matcher = Matcher(nlp.vocab)
?
doc = nlp("Some people start their day with lemon water")
# 定义规则
pattern = [{‘TEXT‘: ‘lemon‘}, {‘TEXT‘: ‘water‘}]
?
# 添加规则
matcher.add(‘rule_1‘, [pattern])
matchers = matcher(doc)
print(matchers)
原文:https://www.cnblogs.com/H-saku/p/14794622.html