puppeteer实践及分析

# 安装及配置

# puppeteer介绍

# Puppeteer,Headless Chrome和Nodejs

Puppeteer是一个通过开发者工具协议对Chrome和Chromium提供高级API的操纵。Puppeteer默认运行headless版本,但是可以配置成运行Chrome或者Chromium。 这是一个可以在Nodejs环境运行的浏览器。中文版本官方文档 (opens new window)

Note: 当你安装 Puppeteer 时,它会下载最新版本的Chromium(~170MB Mac,~282MB Linux,~280MB Win),以保证可以使用 API。

Puppeteer 用处

  • 利用网页生成PDF、图片
  • 爬取SPA应用,并生成预渲染内容(即“SSR” 服务端渲染)
  • 可以从网站抓取内容
  • 自动化表单提交、UI测试、键盘输入等
  • 帮你创建一个最新的自动化测试环境(chrome),可以直接在此运行测试用例6.捕获站点的时间线,以便追踪你的网站,帮助分析网站性能问题

# 简单示范

const puppeteer = require('puppeteer')
async function download() {
  const brower = await puppeteer.launch()
  const page = await brower.newPage()
  // await page.goto('http://10.45.xx.xx:9001/report.html', { waitUntil: 'networkidle0'})
  await page.goto('https://www.baidu.com', { waitUntil: 'networkidle0'})
  const pdf = await page.pdf({path: './test.pdf', format: 'A4'})
  await brower.close()
  return pdf
}
download()
1
2
3
4
5
6
7
8
9
10
11

大概解读一下上面几行代码:

  1. 通过 puppeteer.launch() 创建一个浏览器实例 Browser 对象
  2. 通过 Browser 对象创建页面 Page 对象
  3. 调用 page.goto() 跳转到指定的页面
  4. 调用 page.pdf() 生成 PDF 文件
  5. 关闭浏览器
const puppeteer = require('puppeteer');
async function getPic() {
  //当我们使用{headless:false}运行时,您可以真实看到 Google Chrome 按照您的代码工作。
  const browser = await puppeteer.launch({headless: false});
  const page = await browser.newPage();
  await page.goto('https://baidu.com');
  //可以通过添加以下代码行来更改页面的大小
  await page.setViewport({width: 1000, height: 500})
  await page.screenshot({path: 'google.png'});
  await browser.close();
}
getPic();
1
2
3
4
5
6
7
8
9
10
11
12

自带样式模版:

// app.js
const puppeteer = require('puppeteer');
(async () => {
  const browser = await puppeteer.launch({args: ['--no-sandbox', '--disable-setuid-sandbox'], ignoreHTTPSErrors: true, headless: true});
  const page = await browser.newPage();
  await page.goto("https://www.baidu.com", {timeout: 3000,waitUntil: 'networkidle2'});

  await page.pdf({
      path: 'example.pdf', 
      format: 'A4',
      printBackground: true,
        preferCSSPageSize: true,
        displayHeaderFooter: true,
        format: 'A4',
        margin: {
            top: '2cm',
            bottom: '2cm'
        },
        headerTemplate: `<div style="width:100%;text-align:right;margin-right: 20px;font-size:10px">页头</div>`,
        footerTemplate: `<div style="width:100%;text-align:right;margin-right: 20px;font-size:10px">页尾</div>`
    });
 
  await browser.close();
})()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

# 环境安装

Puppeteer从v1.7.0开始额外提供一个puppeteer-core的库,它只包含Puppeteer的核心库,默认不下载chromium;

依赖安装

npm i puppeteer-core
#如果连puppeteer都安装不了,建议使用淘宝镜像
npm config set registry="https://registry.npm.taobao.org"
1
2
3

如果Chromium是自行下载的,则启动headless浏览器时需增加如下配置项

this.browser = await puppeteer.launch({
  // MacOS应该在"xxx/Chromium.app/Contents/MacOS/Chromium",Linux应该"/usr/bin/chromium-browser"
  executablePath: "Chromium的安装路径",
  args: ['--no-sandbox', '--disable-dev-shm-usage'],// 去沙盒
});
1
2
3
4
5

# centos上安装puppeteer

在centos上安装puppeteer时,会有一些基本库安装不上,尝试使用如下命令安装

yum -y install libX11 libXcomposite libXcursor libXdamage libXext libXi libXtst cups-libs libXScrnSaver libXrandr alsa-lib pango atk at-spi2-atk gtk3
1

# 在 Docker中使用

puppeteer 安装的 Chromium 缺少必要的依赖项; Docker 的 node 源镜像例如 node-alpine 或者 node-slim 是缺失的。所以,在 Docker 中使用 Puppeteer 需要首先安装这些缺失的依赖。puppeteer-docker 官方推荐 (opens new window)

image-20220406105153089

Getting headless Chrome up and running in Docker can be tricky. The bundled Chromium that Puppeteer installs is missing the necessary shared library dependencies.

To fix, you'll need to install the missing dependencies and the latest Chromium package in your Dockerfile:

# 启动注意事项

加入下面环境变量可以使得 Puppeteer 跳过下载自带的 Chromium。

ENV PUPPETEER_SKIP_CHROMIUM_DOWNLOAD true
1

程序使用 Puppeteer 时需要加入如下参数

const browser =  await puppeteer.launch({
    //executablePath: '/usr/bin/chromium-browser',
    args: ['--disable-dev-shm-usage', '--no-sandbox']
});
1
2
3
4
  • executablePath 参数指定的是 aipine 版 Chromium 的启动路径。
  • args 参数中的 --disable-dev-shm-usage 是为了解决 Docker 中 /dev/shm 共享内存太小不足以支持 Chromium 运行的问题,详见 TIPS (opens new window)
  • args 参数中的 --no-sandbox 是为了避免 Chromium 在 Linux 内核中由 sandbox 导致的启动问题。从安全角度来看,chrome 是不应该在root用户权限下运行的,如果真的想在root下运行需要使用 --no-sandbox 来运行

# ubuntu环境依赖

node做服务端,并做成docker,里面有生成pdf,直接安装puppeteer,不能直接生成puppeteer

RUN apt-get update && \
    apt-get -y install xvfb gconf-service libasound2 libatk1.0-0 libc6 libcairo2 libcups2 \
      libdbus-1-3 libexpat1 libfontconfig1 libgcc1 libgconf-2-4 libgdk-pixbuf2.0-0 libglib2.0-0 \
      libgtk-3-0 libnspr4 libpango-1.0-0 libpangocairo-1.0-0 libstdc++6 libx11-6 libx11-xcb1 libxcb1 \
      libxcomposite1 libxcursor1 libxdamage1 libxext6 libxfixes3 libxi6 libxrandr2 libxrender1 libxss1 \
      libxtst6 ca-certificates fonts-liberation libappindicator1 libnss3 lsb-release xdg-utils wget && \
    rm -rf /var/lib/apt/lists/*
1
2
3
4
5
6
7

puppeteer生成pdf依赖pdftk工具,所以安装docker-pdftk镜像

RUN apt-get update \
    && apt-get install -y pdftk mc\
    && apt-get clean autoclean \
    && apt-get autoremove --yes \
    && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/* \
    && mkdir /input \
    && mkdir /output 
1
2
3
4
5
6
7

时区问题处理。系统时间会较北京时间少8小时

ENV TZ=Asia/Shanghai
ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone |tzdata
1
2

其他参考官网设置:直接安装google-chrome-stable

FROM node:12-slim

# Install latest chrome dev package and fonts to support major charsets (Chinese, Japanese, Arabic, Hebrew, Thai and a few others)
# Note: this installs the necessary libs to make the bundled version of Chromium that Puppeteer
# installs, work.
RUN apt-get update \
    && apt-get install -y wget gnupg \
    && wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add - \
    && sh -c 'echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google.list' \
    && apt-get update \
    && apt-get install -y google-chrome-stable fonts-ipafont-gothic fonts-wqy-zenhei fonts-thai-tlwg fonts-kacst fonts-freefont-ttf libxss1 \
      --no-install-recommends \
    && rm -rf /var/lib/apt/lists/*

# If running Docker >= 1.13.0 use docker run's --init arg to reap zombie processes, otherwise
# uncomment the following lines to have `dumb-init` as PID 1
# ADD https://github.com/Yelp/dumb-init/releases/download/v1.2.2/dumb-init_1.2.2_x86_64 /usr/local/bin/dumb-init
# RUN chmod +x /usr/local/bin/dumb-init
# ENTRYPOINT ["dumb-init", "--"]

# Uncomment to skip the chromium download when installing puppeteer. If you do,
# you'll need to launch puppeteer with:
#     browser.launch({executablePath: 'google-chrome-stable'})
# ENV PUPPETEER_SKIP_CHROMIUM_DOWNLOAD true

# Install puppeteer so it's available in the container.
RUN npm init -y &&  \
    npm i puppeteer \
    # Add user so we don't need --no-sandbox.
    # same layer as npm install to keep re-chowned files from using up several hundred MBs more space
    && groupadd -r pptruser && useradd -r -g pptruser -G audio,video pptruser \
    && mkdir -p /home/pptruser/Downloads \
    && chown -R pptruser:pptruser /home/pptruser \
    && chown -R pptruser:pptruser /node_modules \
    && chown -R pptruser:pptruser /package.json \
    && chown -R pptruser:pptruser /package-lock.json

# Run everything after as non-privileged user.
USER pptruser

CMD ["google-chrome-stable"]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41

提交构建及运行:

docker build -t puppeteer-chrome-linux .

docker run -i --init --rm --cap-add=SYS_ADMIN \
   --name puppeteer-chrome puppeteer-chrome-linux \
   node -e "`cat yourscript.js`"
1
2
3
4
5

# centos内部自制镜像依赖【推荐】

# 镜像部分

安装依赖:

On a barebones install of CentOS 7 (on Amazon AWS EC2), I was able to get chrome headless running with the following: sudo yum install -y atk java-atk-wrapper at-spi2-atk gtk3 libXt

简单跑通示范

const puppeteer = require('puppeteer')
async function download() {
  const brower = await puppeteer.launch({
    args: ['--disable-dev-shm-usage', '--no-sandbox']
  })
  const page = await brower.newPage()
  // await page.goto('http://10.45.xx.xx:9001/report.html', { waitUntil: 'networkidle0'})
  await page.goto('https://www.baidu.com', { waitUntil: 'networkidle0'})
  const pdf = await page.pdf({path: './test.pdf', format: 'A4'})
  await brower.close()
  return pdf
}
download()
1
2
3
4
5
6
7
8
9
10
11
12
13

安装puppeteer.

  • 通常是在项目目录(含package.json)里安装所有依赖库时一起安装,即执行: npm install
  • 如果只是为了测试可以直接安装在当前目录: npm install puppeteer
  • 安装完了可以用下面命令看一下chrome可执行文件还缺失哪些依赖库(版本号可能不同): ldd node_modules/puppeteer/.local-chromium/linux-706915/chrome-linux/chrome

发现导出的pdf渲染不出来;考虑安装中文字体库;

# 支持中文
RUN localedef -c -f UTF-8 -i zh_CN zh_CN.utf8
# 安装依赖库
#apt-get install -y google-chrome-stable
RUN yum install -y kde-l10n-Chinese pango.x86_64 libXcomposite.x86_64 libXcursor.x86_64 libXdamage.x86_64 libXext.x86_64 libXi.x86_64 libXtst.x86_64 cups-libs.x86_64 libXScrnSaver.x86_64 libXrandr.x86_64 GConf2.x86_64 alsa-lib.x86_64 atk.x86_64 gtk3.x86_64 \
# 字体库相关;注意最后面两个字体库
ipa-gothic-fonts xorg-x11-fonts-100dpi xorg-x11-fonts-75dpi xorg-x11-utils xorg-x11-fonts-cyrillic xorg-x11-fonts-Type1 xorg-x11-fonts-misc wqy-unibit-fonts.noarch wqy-zenhei-fonts.noarch
1
2
3
4
5
6
7

再次下载依赖:

npm config set registry https://registry.npm.taobao.org/
npm config set puppeteer_download_host=https://npm.taobao.org/mirrors
#npm install puppeteer #nodejs的rpm安装版本,直接安装;
npm install --unsafe-perm=true --allow-root
1
2
3
4

可能碰到的情况:

情况一: 解决方案:npm install --unsafe-perm=true --allow-root

ERROR: Failed to set up Chromium r970485! Set "PUPPETEER_SKIP_DOWNLOAD" env variable to skip download.
{ [Error: EACCES: permission denied, mkdir '/srv/app/pdf-render-puppeteer/node_modules/puppeteer/.local-chromium']
  errno: -13,
  code: 'EACCES',
  syscall: 'mkdir',
  path:
   '/srv/app/pdf-render-puppeteer/node_modules/puppeteer/.local-chromium' }
npm WARN pdf-render-puppeteer@1.0.0 No repository field.
1
2
3
4
5
6
7
8

情况二: 解决方案:安装上面说到的依赖;

Fail to initialze renderer. Error: Failed to launch the browser process!
/srv/app/pdf-render-puppeteer/node_modules/puppeteer/.local-chromium/linux-970485/chrome-linux/chrome: error while loading shared libraries: libatk-1.0.so.0: cannot open shared object file: No such file or directory

TROUBLESHOOTING: https://github.com/puppeteer/puppeteer/blob/main/docs/troubleshooting.md
1
2
3
4
# Nodejs-express部分

代码:

app.use(async (req, res, next) => {
  let { url, type, filename, ...options } = req.query
  if (!url) {
    return res.status(400).send('Search with url parameter. For eaxample, ?url=http://yourdomain')
  }
  if (!url.includes('://')) {
    url = `http://${url}`
  }
  try {
    switch (type) {
      case 'pdf':
        const urlObj = new URL(url)
        if (!filename) {
          filename = urlObj.hostname
          if (urlObj.pathname !== '/') {
            filename = urlObj.pathname.split('/').pop()
            if (filename === '') filename = urlObj.pathname.replace(/\//g, '')
            const extDotPosition = filename.lastIndexOf('.')
            if (extDotPosition > 0) filename = filename.substring(0, extDotPosition)
          }
        }
        if (!filename.toLowerCase().endsWith('.pdf')) {
          filename += '.pdf'
        }
        const { contentDispositionType, ...pdfOptions } = options
        const pdf = await renderer.pdf(url, pdfOptions)
        res
          .set({
            'Content-Type': 'application/pdf',
            'Content-Length': pdf.length,
            'Content-Disposition': contentDisposition(filename, {
              type: contentDispositionType || 'attachment',
            }),
          })
          .send(pdf)
        break
      default:
        const html = await renderer.html(url, options)
        res.status(200).send(html)
    }
  } catch (e) {
    next(e)
  }
})
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44

pm2管理及启动脚本

#!/bin/bash
PRO_NAME=pdf-server
#pm2 stop $PRO_NAME && pm2 del $PRO_NAME
#pm2 start app.js -i max -n $PRO_NAME
pm2 startOrRestart app.js -i max -n $PRO_NAME
1
2
3
4
5

要做成配置启动的话:

deploy.json

{
  "apps" : [
    {
      "name"      : "pdf-server",
      "script"    : "./app.js",
      "max_memory_restart": "2G",
      "exec_mode": "cluster",
      "instances": -1,
      "env": {
        "COMMON_VARIABLE": "true",
        "PORT": 9527
      }
    }
  ]
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

换一种方式启动:start.sh

#!/bin/bash
pm2 startOrRestart deploy.json
1
2

访问示范:

http://10.45.xxx.116:9527?url=http://10.45.46.xxxx:9001/report.html&type=pdf
1
# Nodejs-egg部分

参考示范

'use strict';
const Controller = require('egg').Controller;
const fs = require('fs');
const path = require('path');
const puppeteer = require('puppeteer');
const moment = require('moment');
class ReportController extends Controller {
  async index() {
    const { ctx } = this;
    // 启动pupeteer,加载页面
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.setViewport({
      width: 1920,
      height: 1080
    });
    // 打开页面; 加载本地要渲染的静态文件;
    await page.goto('http://localhost:8080', {
      waitUntil: 'networkidle0'
    });
    // 生成pdf
    let pdfFileName = `体检报告_${moment(new Date()).format('YYYYMMDDHHmm') + '.pdf'}`
    let pdfFilePath = path.join(__dirname, '../../temp/', pdfFileName);
    await page.pdf({
      path: pdfFilePath,
      format: 'A4',
      scale: 1,
      printBackground: true,
      landscape: false,
      displayHeaderFooter: false
    });
    browser.close();
    // 返回文件路径
    ctx.status = 200
    ctx.body = {
      url: `${ctx.request.protocol}://${ctx.request.host}/resource/${pdfFileName}`
    }
  }
}
module.exports = ReportController;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40

# node-aipine 源镜像

FROM node:9-alpine
RUN apk update && apk upgrade && \
    echo http://nl.alpinelinux.org/alpine/edge/community >> /etc/apk/repositories && \
    echo http://nl.alpinelinux.org/alpine/edge/main >> /etc/apk/repositories && \
    apk add --no-cache \
      zlib-dev \
      xvfb \
      xorg-server \
      dbus \
      ttf-freefont \
      chromium \
      nss \
      ca-certificates \
      dumb-init
1
2
3
4
5
6
7
8
9
10
11
12
13
14

这些命令会安装 Chromium 以及其必要的依赖。安装 Chromium 是因为可以以 node-alpine 或者 node-slim 镜像为基础,安装好 Chromium 和其他依赖以后打包新的 image 作为项目中使用的 docker 源。这样可以极大的减少 docker build 的时间。

# 现成第三方镜像puppeteer-renderer (opens new window)

# 参考配置
FROM weihanli/puppeteer:latest
COPY --chown=pptruser:pptruser ./src .
EXPOSE 3000
1
2
3
FROM zenato/puppeteer
USER root
COPY . /app
RUN cd /app && npm install --quiet
EXPOSE 3000
WORKDIR /app
CMD npm run start
1
2
3
4
5
6
7
# 1:Getting Started

Install dependencies. npm install

Start server (If you can run Chromium); (service port: 3000) npm start

Start server using docker (If you can not run Chromium and installed docker)

docker run -d --name ppt-render -p 9876:3000 zenato/puppeteer-renderer

http://10.45.xxx.116:9876?url=http://10.45.xxx.116:9001/report.html&type=pdf
http://10.45.xxx.116:9876?url=http://10.45.xxx.116:9001/report.html&type=screenshot
1
2
3
4

Test on your browser

Input url http://localhost:{port}/?url=https://www.google.com

If you can see html code, server works fine.

# 2:Integration with existing service.

If you have active service, set proxy configuration with middleware. See puppeteer-renderer-middleware (opens new window) for express.

const renderer = require('puppeteer-renderer-middleware');
const app = express();
app.use(renderer({
  url: 'http://installed-your-puppeteer-renderer-url',
  // userAgentPattern: /My-Custom-Agent/i,
  // excludeUrlPattern: /*.html$/i
  // timeout: 30 * 1000,
}));
// your service logics..
app.listen(8080);
1
2
3
4
5
6
7
8
9
10

# 导出pdf

# 截图功能

上次更新: 2022/04/15, 05:41:27
×