爬虫+GPT 实现多网页数据分析 [工作效率利器]

2024-03-13 13:09:30 大数据 ℃

后台-插件-广告管理-内容页头部广告（手机）

如果你经常用网页看技术文档，专业全面的文档资料都是有很多相关的网页分别展现出来的。那怎么样用GPT去分析这样的内容呢？

针对这样的文档，比如下图是NVIDIA Triton Inference Server [https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html] 相关的信息

通常有2种不同的方法去实现：

可以把多个网页链接分别丢个GPT 或GPTs
抓取多个网页内容，然后将生成的文档丢给GPT, 或是GPTs

方法1，原理简单，但操作较复杂，标准化较难实现，,且不利于GPT分析理解

方法2，步骤会多一个抓取过程，但利于标准化一键操作，生成文档后方便GPT处理。也可以上传知识库，用build GPTs或是调用别人的GPTs处理。

如果大家用第一个方法的话，这提供非常简单的脚本提取相关多个网页的连接。但还是推荐大家用第二种方法去处理，提取文档内容后，可以充分利用GPT的优势去进行分析处理。

方法1，抓取连接脚本

import requestsimport rehref = re.compile('href="(.*?reference-arch.*?)"')base_url = 'https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html'url = 'https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/**'rsp = requests.get(url)# print(rsp.text)hrefs = href.findall(rsp.text)def pad_base(href: str):    if href.startswith('http'):        return href    else:        return base_url + hrefhrefs = [pad_base(href) for href in hrefs]hrefs = list(set(hrefs))hrefs.sort()print(len(hrefs))print('\n'.join(hrefs))

方法2

a.用爬虫抓取信息保存为json文档

抓取方法可以参考github上一篇内容：

git clone https://github.com/builderio/gpt-crawler

更推荐大家使用docker 的方式去获取文档信息。

进到clone下来的文件中，直接运行./run.sh即可

root@yyy:/gpt-crawler/containerapp# lltotal 24drwxr-xr-x 3 root root 4096 Feb  2 18:05 ./drwxr-xr-x 7 root root 4096 Feb  2 17:53 ../drwxr-xr-x 2 root root 4096 Feb  3 08:55 data/-rw-r--r-- 1 root root 1504 Feb  2 16:44 Dockerfile-rw-r--r-- 1 root root  318 Feb  2 16:44 README.md-rwxr-xr-x 1 root root  665 Feb  2 16:44 run.sh*root@yyy: /gpt-crawler/containerapp# ./run.sh

run.sh中会build images并运行命令

#!/bin/bash# Check if there is a Docker image named "crawler"if ! sudo docker images | grep -w 'crawler' > /dev/null; then    echo "Docker repository 'crawler' not found. Building the image..."    # Build the Docker image with the name 'crawler'    sudo docker build -t crawler .else    echo "Docker image already built."fi# Ensure that init.sh script is executablesudo chmod +x ./data/init.sh# Starting docker, mount docker.sock to work with docker-in-docker function, mount data directory for input/output from containersudo docker run --rm -it -v /var/run/docker.sock:/var/run/docker.sock -v ./data:/home/data crawler bash -c "/home/data/init.sh"

抓取过程：

最后会生成一个output的文件。

配置文件在以下路径，按照设计情况修改即可。

root@yyy:/gpt-crawler/containerapp/data# cat config.tsimport { Config } from "./src/config";export const defaultConfig: Config = {  url: "https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html",  match: "https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/**",  maxPagesToCrawl: 50,  outputFileName: "output.json",  maxTokens: 2000000,};

b. 上传json文档到GPT去做相关分析。

具体过程这里就省去了，下一篇给大家分享一下一些高阶的分析方法。今天就先到这里了。

后台-插件-广告管理-内容页尾部广告（手机）

标签：

上一篇：人民在线舆情监测5.0平台：实现数据、算法、功能个性化定制

下一篇：娄底成立国内首个AI智能足球装备研发中心

人工智能物联网_17aiot.com

爬虫+GPT 实现多网页数据分析 [工作效率利器]

评论留言

我要留言

爬虫+GPT 实现多网页数据分析 [工作效率利器]

相关推荐

评论留言

我要留言