python提取html特定标签的特定数据

2024-04-02 04:09•html•阅读 999

1 #!/usr/bin/env python

2 from sgmllib import SGMLParser

3 s = """

4 <html>

5 <head>what's in</head>

6 <td> hello

7 <td> table1 blahblah </td>

8 <td> table </td>

9 </td>

10 ok the end blah

11 </html>

12 """

13 class Parse(SGMLParser):

14 def reset(self):

15 self.found_td = 0

16 SGMLParser.reset(self)

17 def start_td(self, attrs):

18 self.found_td += 1

19 def end_td(self):

20 self.found_td -= 1

21 def handle_data(self, text):

22 if self.found_td > 0:

23 print 'Data: %s' % text

25 p = Parse()

26 p.feed(s)

每个标签设一个标记，然后在handle_date里面判断读取

假如要处理<title>Hello world!</title>

碰到<title>的时候，title的标记由0变1；碰到数据的时候，验证title的标记的值，如果大于0则说明这是title的数据，可以提取出来；碰到</title>的时候，该标记由1变0，再碰到数据便可以识别出这不是title。