再学习正则表达式之后,可以利用正则表达式进行网络爬虫
首先利用网络编程把网页加载到内存,并且保存到本地
利用正则抽取有用的信息。最终打印输出到控制台
爬取网易首页的所有连接
public class SpiderTest { public static String getUrlContent(String toUrl){ BufferedReader br =null; StringBuilder sb = new StringBuilder(); try { URL url = new URL(toUrl); try { br = new BufferedReader(new InputStreamReader(url.openStream())); String temp = ""; while((temp= br.readLine())!=null){ sb.append(temp); } } catch (IOException e) { e.printStackTrace(); } } catch (MalformedURLException e) { e.printStackTrace(); } return sb.toString(); } public static void main(String[] args) { String str = getUrlContent("https://www.163.com"); //Pattern p = Pattern.compile(" ");//取得超链接的所有内容 Pattern p2 = Pattern.compile("href=\".+?\""); //Pattern p2 = Pattern.compile("href=\"(.+?)\""); Matcher m = p2.matcher(str); while(m.find()){ System.out.println(m.group()); //System.out.println(m.group(1)); } }}
结果显示:
href="https://ent.163.com/19/0628/07/EIOA5VR000038FO9.html"href="https://ent.163.com/19/0628/07/EIO7VG3U00038FO9.html"href="http://fashion.163.com/"href="http://lady.163.com/photoview/00A70026/115916.html#p=EIOGR4FS00A70026NOS"href="http://lady.163.com/photoview/00A70026/115915.html#p=EIOGI7DD00A70026NOS"href="http://dy.163.com/"href="http://dy.163.com/v2/article/detail/EINGAP5J05259Q0E.html"后面还有很多。。。。