2016-12-05 84 views
1

我想从本站运行pytest使用wordcount测试 - Unit testing Apache Spark with py.test。问题是我无法启动火花上下文。代码我用来运行星火语境:用pytest测试Spark - 无法在本地模式下运行Spark

@pytest.fixture(scope="session") 
def spark_context(request): 
    """ fixture for creating a spark context 
    Args: 
     request: pytest.FixtureRequest object 
    """ 
    conf = (SparkConf().setMaster("local[2]").setAppName("pytest-pyspark-local-testing")) 
    sc = SparkContext(conf=conf) 
    request.addfinalizer(lambda: sc.stop()) 

    quiet_py4j() 
    return sc 

我使用命令执行此代码:

#first way 
pytest spark_context_fixture.py 

#second way 
python spark_context_fixture.py 

输出:

platform linux2 -- Python 2.7.5, pytest-3.0.4, py-1.4.31, pluggy-0.4.0 
rootdir: /home/mgr/test, inifile: 
collected 0 items 

然后我想用pytest运行wordcount的测试。

pytestmark = pytest.mark.usefixtures("spark_context") 

def test_do_word_counts(spark_context): 
    """ test word couting 
    Args: 
     spark_context: test fixture SparkContext 
    """ 
    test_input = [ 
     ' hello spark ', 
     ' hello again spark spark' 
    ] 

    input_rdd = spark_context.parallelize(test_input, 1) 
    results = wordcount.do_word_counts(input_rdd) 

    expected_results = {'hello':2, 'spark':3, 'again':1} 
    assert results == expected_results 

但输出是:

________ ERROR at setup of test_do_word_counts _________ 
file /home/mgrabowski/test/wordcount_test.py, line 5 
    def test_do_word_counts(spark_context): 
E  fixture 'spark_context' not found 
>  available fixtures: cache, capfd, capsys, doctest_namespace, monkeypatch, pytestconfig, record_xml_property, recwarn, tmpdir, tmpdir_factory 
>  use 'pytest --fixtures [testpath]' for help on them. 

有谁知道这是什么问题的原因是什么?

+0

你在你的机器上安装了spark吗? – Yaron

+0

是的,我安装了Spark 1.6。我能够在命令行中运行pyspark,因此看起来没问题。 –

回答

3

我做了一些研究,最终找到了解决方案。我使用Spark 1.6。

首先,我在我的.bashrc文件中添加了两行。

export SPARK_HOME=/usr/hdp/2.5.0.0-1245/spark 
export PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPA‌​TH 

然后我创建了文件“conftest.py”。文件名非常重要,你不应该改变它,否则你会看到spark_context的错误。如果您在本地模式Spark和不使用纱,conftest.py应该看起来像:

import logging 
import pytest 

from pyspark import HiveContext 
from pyspark import SparkConf 
from pyspark import SparkContext 
from pyspark.streaming import StreamingContext 

def quiet_py4j(): 
    logger = logging.getLogger('py4j') 
    logger.setLevel(logging.WARN) 

@pytest.fixture(scope="session") 
def spark_context(request): 
    conf = (SparkConf().setMaster("local[2]").setAppName("pytest-pyspark-local-testing")) 
    request.addfinalizer(lambda: sc.stop()) 

    sc = SparkContext(conf=conf) 
    quiet_py4j() 
    return sc 

@pytest.fixture(scope="session") 
def hive_context(spark_context): 
    return HiveContext(spark_context) 

@pytest.fixture(scope="session") 
def streaming_context(spark_context): 
    return StreamingContext(spark_context, 1) 

现在,你可以通过使用简单的pytest命令运行测试。 Pytest应该运行Spark并终止它。

如果你使用的纱线可以conftest.py更改为: 进口记录 进口pytest

from pyspark import HiveContext 
from pyspark import SparkConf 
from pyspark import SparkContext 
from pyspark.streaming import StreamingContext 

def quiet_py4j(): 
    """ turn down spark logging for the test context """ 
    logger = logging.getLogger('py4j') 
    logger.setLevel(logging.WARN) 

@pytest.fixture(scope="session", 
      params=[pytest.mark.spark_local('local'), 
        pytest.mark.spark_yarn('yarn')]) 
def spark_context(request): 
    if request.param == 'local': 
     conf = (SparkConf() 
       .setMaster("local[2]") 
       .setAppName("pytest-pyspark-local-testing") 
       ) 
    elif request.param == 'yarn': 
     conf = (SparkConf() 
       .setMaster("yarn-client") 
       .setAppName("pytest-pyspark-yarn-testing") 
       .set("spark.executor.memory", "1g") 
       .set("spark.executor.instances", 2) 
       ) 
    request.addfinalizer(lambda: sc.stop()) 

    sc = SparkContext(conf=conf) 
    return sc 

@pytest.fixture(scope="session") 
def hive_context(spark_context): 
    return HiveContext(spark_context) 

@pytest.fixture(scope="session") 
def streaming_context(spark_context): 
    return StreamingContext(spark_context, 1) 

现在,您可以通过调用py.test -m spark_yarn通过调用py.test -m spark_local和纱线模式以本地模式运行测试。

WORDCOUNT例如

在同一文件夹中创建三个文件:conftest.py(上图),wordcount.py:

def do_word_counts(lines): 
    counts = (lines.flatMap(lambda x: x.split()) 
        .map(lambda x: (x, 1)) 
        .reduceByKey(lambda x, y: x+y) 
      ) 
    results = {word: count for word, count in counts.collect()} 
    return results 

而且wordcount_test.py:

import pytest 
import wordcount 

pytestmark = pytest.mark.usefixtures("spark_context") 

def test_do_word_counts(spark_context): 
    test_input = [ 
     ' hello spark ', 
     ' hello again spark spark' 
    ] 

    input_rdd = spark_context.parallelize(test_input, 1) 
    results = wordcount.do_word_counts(input_rdd) 

    expected_results = {'hello':2, 'spark':3, 'again':1} 
    assert results == expected_results 

现在你可以通过调用pytest来运行测试。

+0

这太棒了。谢谢。一个问题:不,如果我有一个更大的项目,我想在几个文件夹中组织我的火花测试;我现在如何管理conftest.py的工作,因为它似乎在同一个文件夹中有重要的地方。 –

相关问题