I found a interesting data from internet. In 2016 the total federal funding giving to science research.
scala> dfs.agg(sum("2016_dollars").alias("total_usd")).withColumn("USD",$"total_usd".cast("bigint")).show()
+-------------------+---------+
| total_usd| USD|
+-------------------+---------+
|1.403136673080022E8|140313667|
+-------------------+---------+
The funding in different subjects:
scala> dfs.groupBy("cleanedoccupation").agg(sum("2016_dollars").alias("total_usd")).orderBy(desc("total_usd")).show(50,false)
+--------------------------+--------------------+
|cleanedoccupation |total_usd |
+--------------------------+--------------------+
|ENGINEER |5.6210328579001434E7|
|ENGINEER (SOFTWARE) |1.2404909456000067E7|
|SCIENTIST |9626102.924999982 |
|GEOLOGIST |4464040.507000001 |
|CIVIL ENGINEER |4310870.640999997 |
|PHYSICIST |3884441.206999995 |
|ELECTRICAL ENGINEER |1872780.3499999999 |
|ENGINEER PETROLEUM |1801395.8540000003 |
|RESEARCH SCIENTIST |1465032.3850000012 |
|(CHEMIST) |1377201.1840000015 |
|ENGINEERING MANAGER |984317.9560000005 |
|COMPUTER SCIENTIST |971759.3170000003 |
|STATISTICIAN |935231.7600000005 |
|ENGINEER SYSTEMS |923817.563000001 |
|ENGINEER MECHANICAL |893327.1470000002 |
|CONSULTING ENGINEER |891420.4819999998 |
|PHYSICIAN & SCIENTIST |778114.7870000006 |
|ENGINEERING |774313.1520000008 |
|MATHEMATICIAN |702732.044 |
|CHEMICAL ENGINEER |698136.0830000002 |
|GEO PHYSICIST |676961.9880000001 |
|GEOLOGIST - PETROLEUM |518072.99400000006 |
|COMPUTER ENGINEER |505745.18100000016 |
|BIO-CHEMIST |487593.8659999999 |
|PROFESSIONAL ENGINEER |471818.522 |
|ENGINEER - SALES |465968.792 |
|ENGINEER NETWORK |434904.4880000003 |
|ENGINEER STRUCTURAL |380760.5010000001 |
|S.W. ENGINEER |349827.36100000015 |
|ENGINEER - MINING ENGINEER|348709.9329999999 |
|CONSULTANT ENGINEERING |336596.85199999996 |
|ENGINEERING V.P. |333680.348 |
|CHEMISTRY PROFESSOR |324671.82800000004 |
|ENGINEER ( RETIRED ) |315816.797 |
|CHIEF SCIENTIST |296670.8009999999 |
|MEDICAL PHYSICIST |276801.33799999993 |
|ENGINEER - ENVIRONMENTAL |269114.5669999999 |
|PROFESSOR OF PHYSICS |267624.814 |
|''DOMESTIC ENGINEER!'' |256155.36599999998 |
|ENGINEER & MANAGER |242135.45699999997 |
|AEROSPACE ENGINEER |238892.678 |
|ENGINEER & PRESIDENT |235359.60199999998 |
|ENGINEER OWNER |235189.19400000002 |
|ENVIRONMENTAL /SCIENTIST |233449.48299999995 |
|RESEARCH ENGINEER |225008.49199999997 |
|ENGINEER & PRINCIPAL |218452.696 |
|ASTRONOMER |213948.17700000003 |
|SYSTEM ENGINEER |198212.448 |
|ENGINEER PROJECT |181820.63699999996 |
|DIRECTOR ENGINEERING |180311.91499999998 |
+--------------------------+--------------------+
only showing top 50 rows
The total number of research fields:
scala> dfs.select("cleanedoccupation").distinct().count()
val res57: Long = 9266
The funding usage grouped by project names:
scala> dfs.groupBy("cmte_nm").agg(sum("2016_dollars").alias("total_usd")).orderBy(desc("total_usd")).show(false)
+------------------------------------------------------+--------------------+
|cmte_nm |total_usd |
+------------------------------------------------------+--------------------+
|OBAMA FOR AMERICA |1.7751455284000028E7|
|HILLARY FOR AMERICA |7670062.769999982 |
|REPUBLICAN NATIONAL COMMITTEE |7044682.358000069 |
|BERNIE 2016 |5947401.752000537 |
|DNC SERVICES CORPORATION/DEMOCRATIC NATIONAL COMMITTEE|4987350.171000016 |
|ROMNEY FOR PRESIDENT INC. |4380457.562999998 |
|DEMOCRATIC SENATORIAL CAMPAIGN COMMITTEE |2962307.6000000057 |
|JOHN MCCAIN 2008 INC. |2528159.003999992 |
|DEMOCRATIC CONGRESSIONAL CAMPAIGN COMMITTEE |2420596.933 |
|HILLARY CLINTON FOR PRESIDENT |2150429.051999999 |
|CRUZ FOR PRESIDENT |1856988.9939999753 |
|NATIONAL REPUBLICAN SENATORIAL COMMITTEE |1711146.4410000006 |
|DCCC |1627752.27999995 |
|RON PAUL 2008 PRESIDENTIAL CAMPAIGN COMMITTEE |1564211.8560000015 |
|RON PAUL 2012 PRESIDENTIAL CAMPAIGN COMMITTEE INC. |1529315.4300000046 |
|DSCC |1449962.9319999968 |
|DONALD J. TRUMP FOR PRESIDENT, INC. |1291077.7140000004 |
|BILL FOSTER FOR CONGRESS COMMITTEE |1242041.2119999994 |
|NATIONAL REPUBLICAN CONGRESSIONAL COMMITTEE |1085679.1910000003 |
|NRCC |699361.8919999996 |
+------------------------------------------------------+--------------------+
only showing top 20 rows
The dataset is public for free downloading. Just for fun to analyse it with Apache Spark.
Return to home | Generated on 09/29/22