2010-01-19 21 views
2

我有这一点我想从我从维基百科获取的页面中移除的文本。在PHP中使用正则表达式解析(解析Wikipedia标记)

{{Historical populations|type=USA 
| 1698|4937 
| 1712|5840 
| 1723|7248 
| 1737|10664 
| 1746|11717 
| 1756|13046 
| 1771|21863 
| 1790|33131 
| 1800|60515 
| 1810|96373 
| 1820|123706 
| 1830|202589 
| 1840|312710 
| 1850|515547 
| 1860|813669 
| 1870|942292 
| 1880|1206299 
| 1890|1515301 
| 1900|3437202 
| 1910|4766883 
| 1920|5620048 
| 1930|6930446 
| 1940|7454995 
| 1950|7891957 
| 1960|7781984 
| 1970|7894862 
| 1980|7071639 
| 1990|7322564 
| 2000|8008288 
| 2008*|8363710 
|footnote=Beginning 1900, figures are for consolidated city of five boroughs. Sources: 1698–1771,{{cite book|last=Greene and Harrington|first=|title=American Population Before the Federal Census of 1790|publisher=|location=New York|year=1932|isbn=|pages=}}, as cited in: {{cite book|last=Rosenwaike|first=Ira|title=Population History of New York City|publisher=Syracuse University Press|location=Syracuse, N.Y.|year=1972|isbn=0815621558|page=8}} 1790–1990,Gibson, Campbell.[http://www.census.gov/population/www/documentation/twps0027.html Population of the 100 Largest Cities and Other Urban Places in the United States:1790 to 1990], [[United States Census Bureau]], June 1998. Retrieved June 12, 2007. *2008 est[http://factfinder.census.gov/servlet/SAFFPopulation?_event=Search&geo_id=16000US3403940&_geoContext=01000US%7C04000US34%7C16000US3403940&_street=&_county=new+york+city&_cityTown=new+york+city&_state=04000US36&_zip=&_lang=en&_sse=on&ActiveGeoDiv=geoSelect&_useEV=&pctxt=fph&pgsl=160&_submenuId=population_0&ds_name=null&_ci_nbr=null&qr_name=null&reg=null%3Anull&_keyword=&_industry=Census Data for New York city, New York], [[United States Census Bureau]]. Retrieved June 12, 2007. 
}} 

下面的部分,我想保留为纯文本还(但不包括部分包裹着 “{{” 和 “}}”

New York is the most populous city in the United States, with an estimated 2008 population of 8,363,710(up from 7.3 million in 1990). This amounts to about 40.0% of New York State's population and a similar percentage of the metropolitan regional population. Over the last decade the city's population has been increasing and demographers estimate New York's population will reach between 9.2 and 9.5 million by 2030.{{cite web |title=New York City Population Projections by Age/Sex and Borough, 2000-2030 |publisher=[[New York City Department of City Planning]] |month=December | year=2006 |url=http://www.nyc.gov/html/dcp/pdf/census/projections_report.pdf |format=PDF |accessdate=2008-09-01}} See also {{cite news |last=Roberts, Sam |title=By 2025, Planners See a Million New Stories in the Crowded City |publisher=New York Times |date=February 19, 2006 |url=http://www.nytimes.com/2006/02/19/nyregion/19population.html?ex=1298005200&en=c586d38abbd16541&ei=5090&partner=rssuserland&emc=rss |accessdate=2008-09-01}} 

感谢。

+0

你有没有已经尝试过的正则表达式的例子? – 2010-01-19 16:47:25

回答

2

当前的代码我正在使用的是以下内容来清理Wiki页面,例如这一个:

http://en.wikipedia.org/wiki/Tel_Aviv(您可以通过单击“编辑本页”查看标记

我得到这个返回:

“,并让位给其美誉为‘不夜城’地中海大都市。 Haaretz编辑这是该国的金融资本和主要的表演艺术和商业中心。特拉维夫市区是中东地区第二大城市经济体,被Foreign Policys 2008全球城市指数排在全球第42位。它也是该地区最昂贵的城市,也是全球第17个最昂贵的城市。以色列的生活成本很高,特拉维夫是其生活费用最高的城市。根据位于纽约的人力资源咨询公司Mercer的资料,截至2008年,特拉维夫是中东地区最昂贵的城市,在世界上排名第14位。它落后于新加坡和巴黎,在这方面仅次于悉尼和都柏林。通过比较,纽约市是第22届”

这是不正确的,预期的结果应该是:

特拉维夫 - 雅法(希伯来语:תֵּל-אָבִיב-יָפוֹ;阿拉伯语:تلأبيب, Tall'Abīb),通常称为特拉维夫,是以色列第二大城市,人口估计为39.39万,位于以色列地中海沿岸,面积为51.8平方公里(20.0平方mi),位于以色列地中海沿岸。该城市是Gush Dan大都市地区中规模最大,人口最多的城市,截至2008年,该城市拥有315万人口。该城市由特拉维夫 - 雅法市政府管理,由Ron Huldai负责。

对于这个PHP代码:

function clean_wiki_text($text) 
    { 
    // first get rid of UGC HTML tags 
    $text = strip_tags($text); 

    // keep convert tag 
    $text = preg_replace("/\{\{convert\|([^\|]+)\|([^\|]+)\|[^\}]+\}\}/", "$1$2", $text); 

    // remove large blocks (treat as tags) 
    $text = preg_replace("/(<![^>]+>)/", '', $text); 
    $text = preg_replace('/\{\{\s?/', '<', $text); 
    $text = str_replace('}}', ' />', $text); 

    $text = str_replace('<! />', '', $text); 

    // more wiki formatting 
    $text = preg_replace("/'{2,6}/", '', $text); 
    $text = preg_replace("/[=\s]+External [lL]inks[\s=]+/", '', $text); 
    $text = preg_replace("/[=\s]+See [aA]lso[\s=]+/", '', $text); 
    $text = preg_replace("/[=\s]+References[\s=]+/", '', $text); 
    $text = preg_replace("/[=\s]+Notes[\s=]+/", '', $text); 
    $text = preg_replace('/\{\{([^\}]+)\}\}/', '', $text); 

    // drop page link text 
    $text = preg_replace('/\[\[([^:\|\]]+)\|([^:\]]+)\]\]/', "$2", $text); 
    // or keep it with preg_replace('/\[\[([^:\|\]]+)\|([^:\]]+)\]\]/', "$1 ($2)", $text); 

    $text = preg_replace('/\(\[[^\]]+\]\)/', '', $text); 
    $text = preg_replace('/\[\[([^:\]]+)\]\]/', "$1", $text); 
    $text = preg_replace('/\*?\s?\[\[([^\]]+)\]\]/', '', $text); 
    $text = preg_replace('/\*\s?\[([^\s]+)\s([^\]]+)\]/', "$2", $text); 
    $text = preg_replace('/\n(\*+\s?)/', '', $text); 
    $text = preg_replace('/\n{3,}/', "\n\n", $text); 
    $text = preg_replace('/<ref[^>]?>[^>]+>/', '', $text); 
    $text = preg_replace('/<cite[^>]?>[^>]+>/', '', $text); 

    $text = preg_replace('/={2,}/', '', $text); 
    $text = preg_replace('/{?class="[^"]+"/', "", $text); 
    $text = preg_replace('/!?\s?width="[^"]+"/', "", $text); 
    $text = preg_replace('/!?\s?height="[^"]+"/', "", $text); 
    $text = preg_replace('/!?\s?style="[^"]+"/', "", $text); 
    $text = preg_replace('/!?\s?rowspan="[^"]+"/', "", $text); 
    $text = preg_replace('/!?\s?bgcolor="[^"]+"/', "", $text); 

    $text = trim($text); 

    $text = preg_replace('/\n\n/', "<br />\n<br />\n", $text); 
    $text = preg_replace('/\r\n\r\n/', "<br />\r\n<br />\r\n", $text); 
/* 
    $config = array(
     'show-body-only' => true, 
     'clean'   => false, 
     'wrap'   => 0, 
     'show-warnings' => 0, 
     'show-errors' => 0, 
     'enclose-block-text' => false, 
     'vertical-space' => true, 
     'output-html' => true 
    ); 

    // Tidy 
    $tidy = new tidy; 
    $tidy->parseString($text, $config, 'utf8'); 
    $tidy->cleanRepair(); 

    $text = $tidy->value; 
*/ 
    $extras = array(
    // "/\((.*?)\)/is" => "", 
     "/\[(.*?)\]/is" => "" 
    ); 
    $text = preg_replace(array_keys($extras), array_values($extras), $text); 

    $text = str_replace(" ,", ',', $text); 
    $text = str_replace(", ", ',', $text); 
    $text = str_replace(",", ', ', $text); 
    $text = str_replace("(, ", '(', $text); 
    $text = str_replace(";,", ',', $text); 

    // lets keep it plain plain plain 
    $text = strip_tags($text); 
// $text = preg_replace('/\s\s+/', ' ', $text); 

    $text = str_replace("|-", '', $text); 
    $text = str_replace("|}", '', $text); 
    $text = str_replace("|", '', $text); 
    $text = str_replace('()', '', $text); 
    $text = str_replace('&nbsp;', ' ', $text); 

    $text = trim($text); 

    $text_arr = preg_split('/[\r\n]+/', $text, -1, PREG_SPLIT_NO_EMPTY); 
    $result = ""; 
    foreach ($text_arr as $paragraph) { 
     if (mb_strlen(trim($paragraph)) > 30) { 
     $result[] = $paragraph; 
     } 
    } 
    return $result; 
    } 
0

这真的很难做当仅提供一个例子的正则表达式 - 从我自己的cleeaning维基百科页面的经验,我知道其他网页也很有可能看起来有点不同。只是为了匹配您的例子很简单:

{{.+?}}\n 

这只能如果有一个换行符要删除的部分后,如果您specifiy DOTALLMULTILINE。配合双大括号的所有对和东西里面:

{{[^}]+}} 

你可以试着做几次运行,各取出另一想要的部分 - 我怀疑这是很可行的,以匹配所有你需要一个正则表达式中。

+0

首先运行这个,除了下面的代码 - 对于那个页面来说,我会测试其他几页 - 但看起来不错。 谢谢 – Simon 2010-01-19 21:34:15

1

只是在这里猜测,但使用维基百科的标记库(与Mediawiki捆绑在一起),将其转换为HTML然后使用任何您熟悉的XML库进行分析是不是更容易和更安全?

API文档可以在http://svn.wikimedia.org/doc/(在Parser模块中)找到,它看起来并不复杂。基本上,所有你需要做的就是像下面这样:

<?php 

require_once '/path/to/mediawiki/Parser.php'; 
// also include whatver classes Parser depends on or use Mediawiki's autoload 
// mechanism if it has any 

// retrieve the content of your page in $content 

$parser = new Parser(); 
$html = $parser->parse($content); 

$simplexml = simplexml_load_string($html); 

现在你有一个非常方便的SimpleXML对象一起玩。当然,这只有在Mediawiki的解析器产生有效的XML(我敢打赌它)时才有效。另外,如果Mediawiki包含某种自动加载机制,则可以通过在Mediawiki的代码库中查找__autoloadspl_autoload_register来轻松找到它。

希望它有帮助!