Elinks/Lynx With Python: How To Speed Up Headless Website Browsing/Parsing/Scraping With Cookies

speeding up website browsing
website parsing and scraping
elinks/lynx
Python
open-source tools
NewsCrawl
General News Extractor
This article explores ways to improve the speed of website browsing, parsing, and scraping using elinks/lynx and Python. It introduces open-source tools like NewsCrawl for sentiment analysis and General News Extractor for news content extraction. Additionally, it covers customizing headless puppeteer/phantomjs and utilizing readability and jsdom to enhance the process.
Published

September 12, 2022


newscrawl 狠心开源企业级舆情新闻爬虫项目:支持任意数量爬虫一键运行、爬虫定时任务、爬虫批量删除;爬虫一键部署;爬虫监控可视化; 配置集群爬虫分配策略;👉 现成的docker一键部署文档已为大家踩坑

general news extractor for extracting main content of news, articles

pip3 install gne

first of all, set it up with a normal user agent

even better, we can chain it with some customized headless puppeteer/phantomjs (do not load video data), dump the dom when ready, and use elinks/lynx to analyze the dom tree.

to test if the recommendation bar shows up:

https://v.qq.com/x/page/m0847y71q98.html

to make web page more readable:

https://github.com/luin/readability

load webpage headlessly:

https://github.com/jsdom/jsdom

https://github.com/ryanpetrello/python-zombie