spyder

2020-01-29

Word count: 895 | Reading time≈ 4 min

jsoup api

爬虫学习

6个包提供用于开发jsoup应用程序的类和接口。

org.jsoup
org.jsoup.examples
org.jsoup.helper
org.jsoup.nodes
org.jsoup.parser
org.jsoup.safety
org.jsoup.salect

主要类：

Jsoup 类提供了连接，清理和解析HTML文档的方法
Document 获取HTML文档
Element 获取、操作HTML节点

三种加载HTML的方法：

@Test
public void test1() throws IOException {
    //从URL加载HTML
    Document document = Jsoup.connect("http://www.baidu.com").get();
    String title = document.title();
    //获取html中的标题
    System.out.println("title :"+title);

    //从字符串加载HTML
    String html = "<html><head><title>First parse</title></head>"
            + "<body><p>Parsed HTML into a doc.</p></body></html>";
    Document doc = Jsoup.parse(html);
    title = doc.title();
    System.out.println("title :"+title);

    //从文件加载HTML
    doc = Jsoup.parse(new File("F:\\jsoup\\html\\index.html"),"utf-8");
    title = doc.title();
    System.out.println("title :"+title);
}

获取html中的head，body,url等信息：

@Test
public void test2() throws IOException {
    Document document = Jsoup.connect("http://www.baidu.com").get();
    String title = document.title();

    System.out.println("title :"+title);
    //获取html中的head
    System.out.println(document.head());
    //获取html中的body
    //System.out.println(document.body());

    //获取HTML页面中的所有链接
    Elements links = document.select("a[href]");
    for (Element link : links){
        System.out.println("link : "+ link.attr("href"));
        System.out.println("text :"+ link.text());
    }
}

获取URL元信息

@Test
public void test3() throws IOException {
    Document document = Jsoup.connect("https://passport.lagou.com").get();

    System.out.println(document.head());
    //获取URL的元信息
    String description = document.select("meta[name=description]").get(0).attr("content");
    System.out.println("Meta description : " + description);

    String keywords = document.select("meta[name=keywords]").first().attr("content");
    System.out.println("Meta keyword : " + keywords);
}

search-detail

根据class名称获取表单

@Test
public void test4() throws IOException {
    Document document = Jsoup.connect("https://passport.lagou.com/login/login.html?signature=8ECBCDF2B86061432B425A0B94FC863B&service=https%253A%252F%252Fwww.lagou.com%252F&action=login&serviceId=lagou&ts=1547711303033").get();
    //获取拉勾网登入页面的body
    //System.out.println(document.body());
    //根据class名称获取表单
    Elements formElement = document.getElementsByClass("form_body");
    System.out.println(formElement.html());
    //获取URL的元信息
    for (Element inputElement : formElement) {
        String placeholder = inputElement.getElementsByTag("input").attr("placeholder");
        System.out.println(placeholder);
    }
}

设计构思

新建独立的package，实现完整的通用接口，在spring中通过定时触发来调用爬虫清洗数据。

需求

静态学习资源，(优先级低)
每日更新学习资源，热评，或热点bug，或热点知识点，可以分类别整理(优先级高)

接口设计

interface message_spyder()//爬取数据以及清洗数据接口(与外部交互)

//单关键词和多关键词检索？？？？

[CSDN,Github,JianShu,博客园(有验证),菜鸟笔记…] * [Q&A,知识点,Bug]

String spyder_csdn(“String key”)//key为关键词，爬取CSDN,https://bbs.csdn.net/topics/395579435 //爬取话题最新问题。

class=“search-list J_search”

a herf 具体问题的连接

String spyder_github(“String key”)//github爬虫方法
String spyder_jianShu(String key)简书爬虫方法
String link_still_live();//有可能链接失效，需要定义一个检查连接是否失效的功能,返回已经失效的连接的String
String set_value_topic();//手动设置精华话题或文章，永久保存起来
String search_url(String url,String key);//通过自定义url接口进行模糊搜索。(难度大)

interface message_manager()//存储数据以及更新数据接口(与数据库交互)

String get_abstract();//子链接主题，子链接内容摘要，子链接，时效
String get_contents();//子链接内容，子链接完整标题

数据库设计

以TEXT和url为主，图片以及其他的暂时不考虑。
可以清洗成功的，以String呈现，(无法清洗的，以URL呈现)。

Donate

Copyright： Copyright is owned by the author. For commercial reprints, please contact the author for authorization. For non-commercial reprints, please indicate the source.