类别使用Java的维基百科中的树抽取

基本上我打算使用维基百科API沙箱在根节点“Economics”下提取维基百科中的整个类别树。我不需要文章的内容，我只需要几个基本的细节，比如pageid，标题，修订历史（在我的工作的某个后期阶段）。到目前为止，我可以逐级提取它，但我想要的是一个递归/迭代函数。每个类别包含一个类别和文章（如每个根包含节点和叶子）。我写了一个代码将第一级提取到文件中。一个文件包含文章，第二个文件夹包含类别的名称（可以进一步分类的根的女儿）。然后我进入关卡，并使用类似的代码提取他们的类别，文章和子类别。代码在每种情况下都保持类似，但它的可扩展性。我需要到达所有节点的最低叶子。所以我需要一个不断检查直到结束的递归。我将包含类别的文件标记为'c_'，因此我可以在提取不同级别时提供条件。现在由于某种原因，它进入了一个僵局，并一直添加相同的东西。我需要一条摆脱僵局的出路。类别使用Java的维基百科中的树抽取

package wikiCrawl; 
import java.awt.List; 
import java.io.BufferedReader; 
import java.io.BufferedWriter; 
import java.io.File; 
import java.io.FileNotFoundException; 
import java.io.FileOutputStream; 
import java.io.FileWriter; 
import java.io.IOException; 
import java.io.InputStreamReader; 
import java.io.OutputStreamWriter; 
import java.net.HttpURLConnection; 
import java.net.MalformedURLException; 
import java.net.URL; 
import java.util.ArrayList; 
import java.util.Scanner; 
import org.apache.commons.io.FileUtils; 
import org.json.CDL; 
import org.json.JSONArray; 
import org.json.JSONException; 
import org.json.JSONObject; 


public class SubCrawl 
{ 
public static void main(String[] args) throws IOException, InterruptedException, JSONException 
{ File file = new File("C:/Users/User/Desktop/Root/Economics_2.txt"); 
    crawlfile(file);  
} 

public static void crawlfile(File food) throws JSONException, IOException ,InterruptedException 
{   
    ArrayList<String> cat_list =new ArrayList <String>(); 
      Scanner scanner_cat = new Scanner(food); 
      scanner_cat.useDelimiter("\n"); 
      while (scanner_cat.hasNext()) 
      { 
       String scan_n = scanner_cat.next(); 
       if(scan_n.indexOf(":")>-1) 
        cat_list.add(scan_n.substring(scan_n.indexOf(":")+1));    
      } 

      System.out.println(cat_list); 

      //get the categories in different languages 
      URL category_json; 
      for (int i_cat=0; i_cat<cat_list.size();i_cat++) 
      {   
       category_json = new URL("https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3A"+cat_list.get(i_cat).replaceAll(" ", "%20").trim()+"&cmlimit=500"); //.trim() removes trailing and following whitespaces 
       System.out.println(category_json); 
       HttpURLConnection urlConnection = (HttpURLConnection) category_json.openConnection(); //Opens the connection to the URL so clients can communicate with the resources. 
       BufferedReader reader = new BufferedReader (new InputStreamReader(category_json.openStream())); 

       String line; 
       String diff = ""; 
       while ((line = reader.readLine()) != null) 
       { 
        System.out.println(line); 
        diff=diff+line; 
       } 
       urlConnection.disconnect(); 
       reader.close(); 

       JSONArray jsonarray_cat = new JSONArray (diff.substring(diff.indexOf("[{\"pageid\""))); 
       System.out.println(jsonarray_cat); 
       //Loop categories 
       for (int i_url = 0; i_url<jsonarray_cat.length();i_url++) //jSONarray is an array of json objects, we are looping through each object 
       { 

        //Get the URL _part (Categorie isn't correct) 
        int pageid=Integer.parseInt(jsonarray_cat.getJSONObject(i_url).getString("pageid")); //this can be written in a much better way 
        System.out.println(pageid); 
        String title=jsonarray_cat.getJSONObject(i_url).getString("title"); 
        System.out.println(title);      

        File food_year= new File("C:/Users/User/Desktop/Root/"+cat_list.get(i_cat).replaceAll(" ", "_").trim()+".txt"); 
        File food_year2= new File("C:/Users/User/Desktop/Root/c_"+cat_list.get(i_cat).replaceAll(" ", "_").trim()+".txt"); 
        food_year.createNewFile(); 
        food_year2.createNewFile(); 

        BufferedWriter writer = new BufferedWriter (new OutputStreamWriter(new FileOutputStream(food_year, true))); 
        BufferedWriter writer2 = new BufferedWriter (new OutputStreamWriter(new FileOutputStream(food_year2, true)));    

        if (title.contains("Category:")) 
        { 
         writer2.write(pageid+";"+title); 
         writer2.newLine(); 
         writer2.flush(); 
         crawlfile(food_year2); 
        } 
        else 
        { 
         writer.write(pageid+";"+title); 
         writer.newLine(); 
         writer.flush(); 
        } 
       } 
      } 
     }

}

来源

2016-06-09 user6446052

对于初学者来说，这可能对维基媒体服务器的需求过大。有超过一百万个类别（https://stats.wikimedia.org/EN/TablesWikipediaEN.htm#namespaces），您需要阅读https://en.wikipedia.org/wiki/Wikipedia:Database_download#Why_not_just_retrieve_data_from_wikipedia.org_at_runtime。 3F –

对于初学者来说，这可能是维基媒体服务器上太大的需求。有超过一百万个类别（1），您需要阅读Wikipedia:Database download - Why not just retrieve data from wikipedia.org at runtime。您需要将您的用途限制在每秒1次左右或者有被阻挡的风险。这意味着需要大约11天才能获得完整的树。

使用https://dumps.wikimedia.org/enwiki/的标准转储会更好，这些将更易于阅读和处理，并且不需要在服务器上承受很大负担。

更好的办法是获得一个Wikimedia Labs帐户，该帐户允许您在转储时对数据库服务器或脚本的复制运行查询，而无需下载一些非常大的文件。

得到公正的经济类别则其最容易通过https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Economics去这有1242类别。您可能会发现使用该类别列表更容易，并从那里构建树。

这会比递归方法更好。维基百科分类系统的问题在于它不是一棵真正的树，有很多循环。如果您继续遵循以下类别，您将最终获得最多的维基百科，我不会感到惊讶。

来源

2016-06-10 13:26:47

嘿，我只是需要它的根类别“经济学” – user6446052

我已经更新了答案。对于与一个主题领域相关的项目，使用相关项目会更容易。看到我修正的答案。 –

我可以在网站上使用类似API沙箱的东西 - https://tools.wmflabs.org吗？ – user6446052

类别使用Java的维基百科中的树抽取

回答

相关问题