[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [linux_var] web data extractor

To: talking@ml.linuxvar.it
Subject: Re: [linux_var] web data extractor
From: JohnnyRun <gianni79@gamebox.net>
Date: Thu, 19 Jun 2008 18:34:49 +0200
In-reply-to: <20080619142711.GA2442@lothlorien.passione>
List-archive: <http://ml.linuxvar.it/wws/arc/talking>
List-help: <mailto:sympa@ml.linuxvar.it?subject=help>
List-id: <talking.ml.linuxvar.it>
List-owner: <mailto:talking-request@ml.linuxvar.it>
List-post: <mailto:talking@ml.linuxvar.it>
List-subscribe: <mailto:sympa@ml.linuxvar.it?subject=subscribe%20talking>
List-unsubscribe: <mailto:sympa@ml.linuxvar.it?subject=unsubscribe%20talking>
References: <20080618211316.GA5121@voo.it> <20080619142711.GA2442@lothlorien.passione>
Reply-to: talking@ml.linuxvar.it
User-agent: Mozilla-Thunderbird 2.0.0.12 (X11/20080406)

Fernando Vezzosi wrote:
> On Wed, Jun 18, 2008 at 11:13:16PM +0200, Francesco De Gasperin wrote:
>> ho una pagina web abbastanza lunga e composta da decine di paragrafi e
>> sottoparagrafi con titoli e quant'altro.
> 
>> Esiste qualche prog che semplifica il procedimento? O parto subito da
>> man perlre?
> 
> Per parsare XML (quindi se non è troppo schifido anche HTML), è molto
> meglio usare XPATH rispetto ad espressioni regolari macchinose.
> 
> Un esempio:
> 
> Per prendere tutti i titoli degli articoli nella prima pagina di reddit:
> 
>   wget -O - http://www.reddit.com/r/programming/ | xpath -e '/html/body/div[@class="content"]//p[@class="title"]/a/text()'
> 
> Per prendere tutti gli indirizzi degli articoli invece:
> 
>   wget -O - http://www.reddit.com/r/programming/ | xpath -e '/html/body/div[@class="content"]//p[@class="title"]/a/@href'
> 
> (nota il // per scendere di livelli multipli)

mhmhm.. ma ce la fa solo con xpath in shell??
L'esempio è piuttosto semplice ma con contenuti più annidati dove per
esempio manca anche un campo, mi sembra un casino.
La vedo più semplice usando xpath in perl/ruby/pyton.
E' questo che suggerisci??
Ciao
JohnnyRun

-- 
Per cancellare l'iscrizione: <talking-unsubscribe at ml.linuxvar.it>
Archivi web e configurazione: http://ml.linuxvar.it/ml/

Follow-Ups:
- Re: [linux_var] web data extractor
  - From: Fernando Vezzosi <fv@linuxvar.it>

References:
- [linux_var] web data extractor
  - From: Francesco De Gasperin <fdg@voo.it>
- Re: [linux_var] web data extractor
  - From: Fernando Vezzosi <fv@linuxvar.it>

Prev by Date: Re: [linux_var] web data extractor
Next by Date: Re: [linux_var] web data extractor
Previous by thread: Re: [linux_var] web data extractor
Next by thread: Re: [linux_var] web data extractor
Index(es):
- Date
- Thread