pyspark mapreduce dataframe

df.rdd \
  .filter(lambda x: x[1] == "france") \ # only french stations
  .map(lambda x: (x[0], x[2])) \ # select station & temp
  .mapValues(lambda x: (x, 1)) \ # generate count
  .reduceByKey(lambda x, y: (x[0]+y[0], x[1]+y[1])) \ # calculate sum & count
  .mapValues(lambda x: x[0]/x[1]) \ # calculate average
  .sortBy(lambda x: x[1], ascending = False) \ # sort
  .take(100)

Posted by: Guest on November-13-2020

Source

df.rdd \ .filter(lambda x: x[1] == "france") \ # only french stations .map(lambda x: (x[0], x[2])) \ # select station & temp .mapValues(lambda x: (x, 1)) \ # generate count .reduceByKey(lambda x, y: (x[0]+y[0], x[1]+y[1])) \ # calculate sum & count .mapValues(lambda x: x[0]/x[1]) \ # calculate average .sortBy(lambda x: x[1], ascending = False) \ # sort .take(100)

Code answers related to "pyspark mapreduce dataframe"

Code answers related to "Python"

Browse Popular Code Answers by Language

Answers for "pyspark mapreduce dataframe"

Code answers related to "pyspark mapreduce dataframe"

Code answers related to "Python"

Python Answers by Framework

Browse Popular Code Answers by Language

Popular Programming Languages

Advertisements

Company

Compilers

Help

Connect with us