2014-02-26 61 views
0

我想使用mahout与配置单元,我将从配置单元获取数据,并使用数据模型来填充数据并使用mahout进行推荐。这可能吗。因为我看到mahout只能用于文件。 1)如何使用配置表格将数据加载到mahout? 2)有没有其他方法可以将mahout推荐与蜂巢或其他人一起使用?集成Hive与Mahout推荐

这里我有配置单元的jdbc结果,我想填充到mahout的DataModel。如何填充?

我想使用数据库结果,而不是从文件读取mahout的建议。 例如:

配置单元:

import java.sql.SQLException; 
    import java.sql.Connection; 
    import java.sql.ResultSet; 
    import java.sql.Statement; 
    import java.sql.DriverManager; 

    public class HiveJdbcClient { 
     private static String driverName = "org.apache.hive.jdbc.HiveDriver"; 

     /** 
     * @param args 
     * @throws SQLException 
     */ 
     public static void main(String[] args) throws SQLException { 
      try { 
      Class.forName(driverName); 
     } catch (ClassNotFoundException e) { 
      // TODO Auto-generated catch block 
      e.printStackTrace(); 
      System.exit(1); 
     } 
     //replace "hive" here with the name of the user the queries should run as 
     Connection con = DriverManager.getConnection("jdbc:hive2://localhost:10000/default", "hive", ""); 
     Statement stmt = con.createStatement(); 
     String tableName = "testHiveDriverTable"; 
     stmt.execute("drop table if exists " + tableName); 
     stmt.execute("create table " + tableName + " (key int, value string)"); 
     // show tables 
     String sql = "show tables '" + tableName + "'"; 
     System.out.println("Running: " + sql); 
     ResultSet res = stmt.executeQuery(sql); 
     if (res.next()) { 
      System.out.println(res.getString(1)); 
     } 
      // describe table 
     sql = "describe " + tableName; 
     System.out.println("Running: " + sql); 
     res = stmt.executeQuery(sql); 
     while (res.next()) { 
      System.out.println(res.getString(1) + "\t" + res.getString(2)); 
     } 

     // load data into table 
     // NOTE: filepath has to be local to the hive server 
     // NOTE: /tmp/a.txt is a ctrl-A separated file with two fields per line 
     String filepath = "/tmp/a.txt"; 
     sql = "load data local inpath '" + filepath + "' into table " + tableName; 
     System.out.println("Running: " + sql); 
     stmt.execute(sql); 

     // select * query 
     sql = "select * from " + tableName; 
     System.out.println("Running: " + sql); 
     res = stmt.executeQuery(sql); 
     while (res.next()) { 
      System.out.println(String.valueOf(res.getInt(1)) + "\t" + res.getString(2)); 
     } 

     // regular hive query 
     sql = "select count(1) from " + tableName; 
     System.out.println("Running: " + sql); 
     res = stmt.executeQuery(sql); 
     while (res.next()) { 
      System.out.println(res.getString(1)); 
     } 
     } 
    } 

象夫:

// Create a data source from the CSV file 
File userPreferencesFile = new File("data/dataset1.csv"); 
DataModel dataModel = new FileDataModel(userPreferencesFile); 

UserSimilarity userSimilarity = new PearsonCorrelationSimilarity(dataModel); 
UserNeighborhood userNeighborhood = new NearestNUserNeighborhood(2, userSimilarity, dataModel); 

// Create a generic user based recommender with the dataModel, the userNeighborhood and the userSimilarity 
Recommender genericRecommender = new GenericUserBasedRecommender(dataModel, userNeighborhood, userSimilarity); 

// Recommend 5 items for each user 
for (LongPrimitiveIterator iterator = dataModel.getUserIDs(); iterator.hasNext();) 
{ 
    long userId = iterator.nextLong(); 

    // Generate a list of 5 recommendations for the user 
    List<RecommendedItem> itemRecommendations = genericRecommender.recommend(userId, 5); 

    System.out.format("User Id: %d%n", userId); 

    if (itemRecommendations.isEmpty()) 
    {`enter code here 
     System.out.println("No recommendations for this user."); 
    } 
    else 
    { 
     // Display the list of recommendations 
     for (RecommendedItem recommendedItem : itemRecommendations) 
     { 
      System.out.format("Recommened Item Id %d. Strength of the preference: %f%n", recommendedItem.getItemID(), recommendedItem.getValue()); 
     } 
    } 
} 

回答

0

亨利马乌版本0.9提供了用于JDBC投诉数据库如MySQL /甲骨文/ Postgress等NoSQL数据模型(源数据)数据库,例如您提到的基于MongoDB/HBase/Cassandra和文件系统。

从本发行版开始,Hive不是100%的SQL标准数据库,数据模型MySQLJDBCDataModel和SQL92JDBCDataModel不适合用于Hive表,因为SQL语法在JDBC投诉数据库中完全不同。

对于第一个问题,您可能需要扩展AbstractJDBCDataModel并覆盖构造函数以传递Hive数据源,并为特定的首选项,首选项时间,用户,所有用户等配置特定的SQL查询,与AbstractJDBCDataModel构造函数。

对于第二个问题,如果您使用的是非分布式算法(Taste算法),则上述方法保持不变。如果使用分布式算法,Mahout可以在Hadoop上运行,并获取由Hive表支持的HDFS文件。请参阅here在Hadoop上运行Mahout

+0

DataModel dataModel = new FileDataModel(userPreferencesFile); UserSimilarity userSimilarity = new PearsonCorrelationSimilarity(dataModel); – techsivam

+0

我可以这样做吗。 //从配置单元中读取数据...在以上示例中 DataModel dataModel = null; (res.next())System.out.println(String.valueOf(res.getInt(1))+“\ t”+ res.getString(2)); dataModel = new GenericDataModel(); //如何加载数据? 这是我的理解吗? } //使用datamodel进行mahout UserSimilarity userSimilarity = new PearsonCorrelationSimilarity(dataModel); – techsivam

+0

Mahout基础库从传入的数据模型中读取数据。因此,数据模型不能作为空值传递。是的,GenericDataModel也可以扩展,并覆盖构造函数以传入特定于hive的SQL查询 –