ETL-petl简介

时间：2021-02-28 21:48:02 阅读：29 评论：0 收藏：0 [点我收藏+]

petl简介

petl是使用原生python编写的ETL包，数据操作逻辑简单，但是处理数据的速度较慢。

ETL pipelines

petl包使用了大量的迭代器和延迟计算，在没有请求函数请求数据时，pipelines 不会开始处理数据。

import petl as etl
table1 = etl.fromcsv(‘example.csv‘)
table2 = etl.convert(table1, ‘foo‘, ‘upper‘)
table3 = etl.convert(table2, ‘bar‘, int)
table4 = etl.convert(table3, ‘baz‘, float)
table5 = etl.addfield(table4, ‘quux‘, lambda row: row.bar * row.baz)

petl.util.vis.look()，petl.io.csv.tocsv()，petl.io.text.totext()，petl.io.sqlite3.tosqlite3()，petl.io.db.todb()这些都是请求函数。执行请求函数后，根据pipelines ，顺序地处理数据。

etl.look(table5)

面向对象编程

petl支持函数式和面向对象编程，例如

>>> import petl as etl
>>> table = (
...     etl
...     .fromcsv(‘example.csv‘)
...     .convert(‘foo‘, ‘upper‘)
...     .convert(‘bar‘, int)
...     .convert(‘baz‘, float)
...     .addfield(‘quux‘, lambda row: row.bar * row.baz)
... )
>>> table.look()

petl中的wrap()函数，可以把有效表容器数据转化为表结构数据

>>> l = [[‘foo‘, ‘bar‘], [‘a‘, 1], [‘b‘, 2], [‘c‘, 2]]
>>> table = etl.wrap(l)
>>> table.look()
+-----+-----+
| foo | bar |
+=====+=====+
| ‘a‘ |   1 |
+-----+-----+
| ‘b‘ |   2 |
+-----+-----+
| ‘c‘ |   2 |
+-----+-----+

交互式使用

交互式环境使用petl时，对象表达默认调用petl.util.vis.look()函数

>>> l = [[‘foo‘, ‘bar‘], [‘a‘, 1], [‘b‘, 2], [‘c‘, 2]]
>>> table = etl.wrap(l)
>>> table
+-----+-----+
| foo | bar |
+=====+=====+
| ‘a‘ |   1 |
+-----+-----+
| ‘b‘ |   2 |
+-----+-----+
| ‘c‘ |   2 |
+-----+-----+

默认调用时使用repr() 函数，以数值型方式打印数据，使用print函数，使用str()，以字符串方式打印数据

>>> print(table)
+-----+-----+
| foo | bar |
+=====+=====+
| a   |   1 |
+-----+-----+
| b   |   2 |
+-----+-----+
| c   |   2 |
+-----+-----+

从操作系统调用petl脚本

$ petl "dummytable().tocsv()" > example.csv
$ cat example.csv | petl "fromcsv().cut(‘foo‘, ‘baz‘).convert(‘baz‘, float).selectgt(‘baz‘, 0.5).head().data().totsv()"

提供一个位置参数"example.csv"给执行函数

表容器和表迭代器

表容器：

实现了__iter__ 方法
__iter__ 返回一个表迭代器
_ iter 返回的所有表迭代器都是独立的，也就是说，从一个迭代器中消耗项不会影响其他迭代器

表迭代器：

迭代器返回的每个项都是一个序列(例如，元组或列表)
迭代器返回的第一个项是包含一系列标题值的标题行
迭代器返回的每个后续项都是由一系列数据值组成的数据行
标题值通常是字符串(str) ，但是可以是任何类型的对象，只要它实现了 __ str__ 并且可以选择
数据值是任何可以取出的对象

例如：

>>> table = [[‘foo‘, ‘bar‘], [‘a‘, 1], [‘b‘, 2]]

table是有效的表容器，实现了__iter__ 方法，返回一个迭代器，第一个项是标题值[‘foo‘, ‘bar‘]，后续是数据值[‘a‘, 1]和[‘b‘, 2]。

要求表容器支持独立的表迭代器(第3点)的主要原因是，来自表的数据可能需要在同一个程序或交互会话中进行多次迭代。例如，当在交互式会话中使用 petl 建立一系列数据转换步骤时，用户可能希望在定义所有步骤和完整执行转换之前检查来自几个中间步骤的输出。

扩展-集成自定义数据源

Io 模块具有从许多已知数据源中提取数据的功能。但是，编写支持与其他数据源集成的扩展也很简单。为了使对象可用作 petl 表，它必须实现上面描述的表容器约定。下面是 ArrayView 类的源代码，该类允许将 petl 与 numpy 数组集成。这个类包含在 petl.io.numpy 模块中，但是也提供了一个如何集成其他数据源的例子:

>>> import petl as etl
>>> class ArrayView(etl.Table):
...     def __init__(self, a):
...         # assume that a is a numpy array
...         self.a = a
...     def __iter__(self): # 实现了__iter__ 方法
...         # yield the header row
...         header = tuple(self.a.dtype.names) # 迭代器返回的第一个项是包含一系列标题值的标题行
...         yield header # 是一个迭代器
...         # yield the data rows
...         for row in self.a:
...             yield tuple(row) # 迭代器返回的每个后续项都是由一系列数据值组成的数据行

此类允许将numpy数组与petl函数一起使用

>>> import numpy as np
>>> a = np.array([(‘apples‘, 1, 2.5),
...               (‘oranges‘, 3, 4.4),
...               (‘pears‘, 7, 0.1)],
...              dtype=‘U8, i4,f4‘)
>>> t1 = ArrayView(a)
>>> t1
+-----------+----+-----------+
| f0        | f1 | f2        |
+===========+====+===========+
| ‘apples‘  | 1  | 2.5       |
+-----------+----+-----------+
| ‘oranges‘ | 3  | 4.4000001 |
+-----------+----+-----------+
| ‘pears‘   | 7  | 0.1       |
+-----------+----+-----------+

>>> t2 = t1.cut(‘f0‘, ‘f2‘).convert(‘f0‘, ‘upper‘).addfield(‘f3‘, lambda row: row.f2 * 2)
>>> t2
+-----------+-----------+---------------------+
| f0        | f2        | f3                  |
+===========+===========+=====================+
| ‘APPLES‘  | 2.5       |                 5.0 |
+-----------+-----------+---------------------+
| ‘ORANGES‘ | 4.4000001 |  8.8000001907348633 |
+-----------+-----------+---------------------+
| ‘PEARS‘   | 0.1       | 0.20000000298023224 |
+-----------+-----------+---------------------+

只要t1符合表容器的定义，就可以使用petl的函数和管道

ETL-petl简介

原文：https://www.cnblogs.com/wry789/p/14459977.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)