首页 > 编程语言 > 详细

在Linux用Python写爬虫(一)

时间:2019-06-01 11:52:24      阅读:168      评论:0      收藏:0      [点我收藏+]

 参考书籍:《Python3 网络爬虫开发实战》2018年4月第一版

系统: Ubuntu 18.04.2 LTS

背景:已经安装好了Tesseract 以及多国语言包 tessdata

安装命令: pip3 install tesserocr pillow

报错:

Collecting tesserocr
Using cached https://files.pythonhosted.org/packages/92/2d/05a7f8387e93c192919b508e4f4936f232bd3d2ca388b9130ae538a9f9ad/tesserocr-2.4.0.tar.gz
Collecting pillow
Using cached https://files.pythonhosted.org/packages/d2/c2/f84b1e57416755e967236468dcfb0fad7fd911f707185efc4ba8834a1a94/Pillow-6.0.0-cp36-cp36m-manylinux1_x86_64.whl
Building wheels for collected packages: tesserocr
Running setup.py bdist_wheel for tesserocr ... error
Complete output from command /usr/bin/python3 -u -c "import setuptools, tokenize;__file__=‘/tmp/pip-build-n7t6st2b/tesserocr/setup.py‘;f=getattr(tokenize, ‘open‘, open)(__file__);code=f.read().replace(‘\r\n‘, ‘\n‘);f.close();exec(compile(code, __file__, ‘exec‘))" bdist_wheel -d /tmp/tmpn73hfamcpip-wheel- --python-tag cp36:
Supporting tesseract v4.0.0-beta.1
Configs from pkg-config: {‘include_dirs‘: [‘/usr/include‘], ‘libraries‘: [‘lept‘, ‘tesseract‘], ‘cython_compile_time_env‘: {‘TESSERACT_VERSION‘: 60397825}}
/usr/lib/python3.6/distutils/dist.py:261: UserWarning: Unknown distribution option: ‘long_description_content_type‘
warnings.warn(msg)
running bdist_wheel
running build
running build_ext
building ‘tesserocr‘ extension
creating build
creating build/temp.linux-x86_64-3.6
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/include -I/usr/include/python3.6m -c tesserocr.cpp -o build/temp.linux-x86_64-3.6/tesserocr.o -std=c++11 -DUSE_STD_NAMESPACE
tesserocr.cpp:42:10: fatal error: Python.h: No such file or directory
#include "Python.h"
^~~~~~~~~~
compilation terminated.
error: command ‘x86_64-linux-gnu-gcc‘ failed with exit status 1

----------------------------------------
Failed building wheel for tesserocr
Running setup.py clean for tesserocr
Failed to build tesserocr
Installing collected packages: tesserocr, pillow
Running setup.py install for tesserocr ... error
Complete output from command /usr/bin/python3 -u -c "import setuptools, tokenize;__file__=‘/tmp/pip-build-n7t6st2b/tesserocr/setup.py‘;f=getattr(tokenize, ‘open‘, open)(__file__);code=f.read().replace(‘\r\n‘, ‘\n‘);f.close();exec(compile(code, __file__, ‘exec‘))" install --record /tmp/pip-7bsa_hbd-record/install-record.txt --single-version-externally-managed --compile --user --prefix=:
Supporting tesseract v4.0.0-beta.1
Configs from pkg-config: {‘include_dirs‘: [‘/usr/include‘], ‘libraries‘: [‘lept‘, ‘tesseract‘], ‘cython_compile_time_env‘: {‘TESSERACT_VERSION‘: 60397825}}
/usr/lib/python3.6/distutils/dist.py:261: UserWarning: Unknown distribution option: ‘long_description_content_type‘
warnings.warn(msg)
running install
running build
running build_ext
building ‘tesserocr‘ extension
creating build
creating build/temp.linux-x86_64-3.6
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/include -I/usr/include/python3.6m -c tesserocr.cpp -o build/temp.linux-x86_64-3.6/tesserocr.o -std=c++11 -DUSE_STD_NAMESPACE
tesserocr.cpp:42:10: fatal error: Python.h: No such file or directory
#include "Python.h"
^~~~~~~~~~
compilation terminated.
error: command ‘x86_64-linux-gnu-gcc‘ failed with exit status 1

 

解决方案:替换新的安装命令 sudo apt install tesseract-ocr

(PS:这个版本与原书中版本命令的差别可能是,此版本并非pillow friendly版本。)

(PPS: PillowPillow is the friendly PIL fork by Alex Clark and Contributors. PIL is thePython Imaging Library by Fredrik Lundh and Contributors.)

 

原文如下:

Linux
To install Tesseract 4.x you can simply run the following command on your Ubuntu 18.xx bionic:

sudo apt install tesseract-ocr
If you wish to install the Developer Tools which can be used for training, run the following command:

sudo apt install libtesseract-dev
The following instructions are for building on Linux, which also can be applied to other UNIX like operating systems.

Dependencies
A compiler for C and C++: GCC or Clang
GNU Autotools: autoconf, automake, libtool
pkg-config
Leptonica
libpng, libjpeg, libtiff

Ubuntu
If they are not already installed, you need the following libraries (Ubuntu 16.04/14.04):

sudo apt-get install g++ # or clang++ (presumably)
sudo apt-get install autoconf automake libtool
sudo apt-get install pkg-config
sudo apt-get install libpng-dev
sudo apt-get install libjpeg8-dev
sudo apt-get install libtiff5-dev
sudo apt-get install zlib1g-dev
if you plan to install the training tools, you also need the following libraries:

sudo apt-get install libicu-dev
sudo apt-get install libpango1.0-dev
sudo apt-get install libcairo2-dev

 

原文地址: https://github.com/tesseract-ocr/tesseract/wiki/Compiling

在Linux用Python写爬虫(一)

原文:https://www.cnblogs.com/chowkaiyat/p/10958834.html

(0)
(0)
   
举报
评论 一句话评论(0
关于我们 - 联系我们 - 留言反馈 - 联系我们:wmxa8@hotmail.com
© 2014 bubuko.com 版权所有
打开技术之扣,分享程序人生!