[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [linux_var] web data extractor

To: talking@ml.linuxvar.it
Subject: Re: [linux_var] web data extractor
From: Fernando Vezzosi <fv@linuxvar.it>
Date: Thu, 19 Jun 2008 16:27:11 +0200
Cc: fdg@voo.it
In-reply-to: <20080618211316.GA5121@voo.it>
List-archive: <http://ml.linuxvar.it/wws/arc/talking>
List-help: <mailto:sympa@ml.linuxvar.it?subject=help>
List-id: <talking.ml.linuxvar.it>
List-owner: <mailto:talking-request@ml.linuxvar.it>
List-post: <mailto:talking@ml.linuxvar.it>
List-subscribe: <mailto:sympa@ml.linuxvar.it?subject=subscribe%20talking>
List-unsubscribe: <mailto:sympa@ml.linuxvar.it?subject=unsubscribe%20talking>
Mail-followup-to: Fernando Vezzosi <fv@linuxvar.it>, talking@ml.linuxvar.it, fdg@voo.it
References: <20080618211316.GA5121@voo.it>
Reply-to: talking@ml.linuxvar.it
User-agent: Mutt/1.5.18 (2008-05-17)

On Wed, Jun 18, 2008 at 11:13:16PM +0200, Francesco De Gasperin wrote:
> ho una pagina web abbastanza lunga e composta da decine di paragrafi e
> sottoparagrafi con titoli e quant'altro.

> Esiste qualche prog che semplifica il procedimento? O parto subito da
> man perlre?

Per parsare XML (quindi se non è troppo schifido anche HTML), è molto
meglio usare XPATH rispetto ad espressioni regolari macchinose.

Un esempio:

Per prendere tutti i titoli degli articoli nella prima pagina di reddit:

  wget -O - http://www.reddit.com/r/programming/ | xpath -e '/html/body/div[@class="content"]//p[@class="title"]/a/text()'

Per prendere tutti gli indirizzi degli articoli invece:

  wget -O - http://www.reddit.com/r/programming/ | xpath -e '/html/body/div[@class="content"]//p[@class="title"]/a/@href'

(nota il // per scendere di livelli multipli)

Per trovare in fretta il percorso XPath verso i tag che ti interessano,
ci sono 2 metodi che conosco.

- usare la firefox developer bar, che ha la "view style information"
  (ctrl+shift+Y), e passare il mouse sopra gli elementi che vuoi.
- usare il tool html2 (o se è XML well-formed xml2).

lothlorien:pts/0 15% dpkg -S =xpath =html2
libxml-xpath-perl: /usr/bin/xpath
xml2: /usr/bin/html2

Forse all'inizio ci vuole un pò per capire la logica, ma secondo me è un
investimento di tempo che vale la pena fare :)

Se l'HTML è troppo schifido c'è la libreria Ruby Beautiful Soup, che si
dice possa parsare l'imparsabile.  Però devi imparare Ruby :)

Ciao,

-- 
  Fernando Vezzosi
	       3F29 4D20 510E E1AE 991D  3B12 D6BE 7C05 B289 97C9

Attachment: signature.asc
Description: Digital signature

Follow-Ups:
- Re: [linux_var] web data extractor
  - From: nextime <nextime@nexlab.it>
- Re: [linux_var] web data extractor
  - From: JohnnyRun <gianni79@gamebox.net>

References:
- [linux_var] web data extractor
  - From: Francesco De Gasperin <fdg@voo.it>

Prev by Date: Re: [linux_var] nuova versione kubuntu
Next by Date: [linux_var] Firefox 3: prime impressioni
Previous by thread: [linux_var] web data extractor
Next by thread: Re: [linux_var] web data extractor
Index(es):
- Date
- Thread