Get Started in ~30 Seconds
DeclareData Fuse is a blazing fast drop-in alternative for running Spark code that operates without the need for high overhead and expensive Spark clusters. It is designed to run on your local machine as well as in production, providing the same APIs and functionality as Apache Spark.
Overview
Challenge
- - Spark is slow in production for ≤5TB on-demand data workloads
- - High overhead and complexity of managing Spark clusters
- - Spark code workflows and test/deploy iterations are too slow
- - High cloud costs for maintaining Spark clusters
Fuse Server vs Apache Spark?
- 🚀 Process terabyte-scale workflows faster locally and in production
- 💰 Cut cloud costs by 50%+ without compromising speed
- 🔥 Speed and simplicity without the overhead of Spark clusters
- 😄 Accelerate data engineering workflows, increase happiness
- ✨ Keep your existing Spark codebase
Key Components
- DeclareData Fuse Server: The blazing fast, simple drop-in Spark alternative
- Fuse Python: PySpark-compatible Python client library for DeclareData Fuse
- Fuse Go (experimental): Go (golang) client library for DeclareData Fuse
Server Setup
Run the latest DeclareData Fuse server using docker:
$
docker run -p 8080:8080 -p 3000:3000 ghcr.io/declaredata/fuse:latest
The server is now running and ready to accept connections on port
8080
and an optional web interface on port 3000
.
Client Installation
Use pip to install the DeclareData Fuse client:
$
pip install declaredata_fuse
Updating the DeclareData Fuse client:
$
pip install --upgrade declaredata_fuse
Usage by Example
Your PySpark code should work without modifications! No need to learn new APIs or refactor your codebase. See also official Github repository for more examples.
# Initialize a new session
from declaredata_fuse.session import FuseSession
# Connect to DeclareData Fuse Server
fs = FuseSession.builder.getOrCreate()
# Read CSV data
df = fs.read.csv("data.csv")
# Show initial data
print("Initial Dataset:")
df.show(5)
# Filter for a specific state
print("\nCalifornia Data:")
ca_data = df.where(df.state_abbr == "CA")
ca_data.show(5)
# Sort by population
print("\nStates Sorted by Population:")
sorted_data = df.sort("population", ascending=False)
sorted_data.show(5)
# Select specific columns
print("\nYear and State Only:")
year_state = df.select("year", "state_abbr")
year_state.show(5)
fs.stop()
# Initialize a new session
from declaredata_fuse.session import FuseSession
# Connect to DeclareData Fuse Server
fs = FuseSession.builder.getOrCreate()
# Use multiple datasets and sources
population_df = fs.read.csv("data.csv")
income_df = fs.read.csv("income.csv")
# Population Rankings
print("Top States by Population:")
population_df.sort("population", ascending=False).show(5)
# Year-by-Year Population Analysis
print("\nYear 2000 Population Data:")
pop_2000 = population_df.where(population_df.year == 2000)
pop_2000.sort("population", ascending=False).show(5)
print("\nYear 2001 Population Data:")
pop_2001 = population_df.where(population_df.year == 2001)
pop_2001.sort("population", ascending=False).show(5)
# State-Specific Analysis
print("\nCalifornia Population Trends:")
ca_pop = population_df.where(population_df.state_abbr == "CA")
ca_pop.sort("year").show(5)
print("\nTexas Population Trends:")
tx_pop = population_df.where(population_df.state_abbr == "TX")
tx_pop.sort("year").show(5)
# Major Population Centers
print("\nStates with Population > 20M:")
major_states = population_df.where(population_df.population >= 20000000)
major_states.sort("year").show(5)
fs.stop()
More Examples
Complex operations like window functions work seamlessly.
from declaredata_fuse.session import FuseSession
fs = FuseSession.builder.getOrCreate()
population_df = fs.read.csv("data.csv")
income_df = fs.read.csv("income.csv")
print("Population Data Sample:")
population_df.show(3)
print("\nUnique States in Dataset:")
states = population_df.select("state_abbr").distinct()
states.show(10)
print("\nLarge States (Population > 20M):")
large_states = population_df.where(population_df.population >= 20000000)
large_states.sort("population", ascending=False).show(5)
print("\nYear 2000 Population Rankings:")
y2000 = population_df.where(population_df.year == 2000)
y2000.sort("population", ascending=False).show(5)
print("\nTop States by Income and Population:")
combined = population_df.join(
income_df,
["state_abbr", "year"]
)
combined.sort("total_income", ascending=False).show(5)
print("\nHighest Population States by Year:")
high_pop = population_df.where(population_df.population >= 20000000)
high_pop.sort("year").show(5)
fs.stop()
Other complex operations like joins and filters work seamlessly.
from declaredata_fuse.session import FuseSession
import declaredata_fuse.functions as F
from declaredata_fuse.window import Window
fs = FuseSession.builder.getOrCreate()
taxi_df = fs.read.csv("taxi.csv")
weather_df = fs.read.csv("weather.csv")
print("Initial Taxi Data:")
taxi_df.show(5)
print("\nWeather Conditions:")
weather_df.show(5)
print("\nHighest Fares by Date:")
taxi_df.groupBy("date").agg(
F.first("fare_amount").alias("highest_fare_of_day")
).sort(
"highest_fare_of_day", ascending=False
).show(5)
print("\nTrips During Snow:")
combined = taxi_df.join(
weather_df, ["date"]
)
snow_trips = combined.where(combined.condition == "Snow")
snow_trips.sort("fare_amount", ascending=False).show(5)
print("\nFare Rankings by Weather Condition:")
window_spec = Window.partitionBy("condition").orderBy(F.desc("fare_amount"))
combined.withColumn(
"fare_rank", F.rank().over(window_spec)
).where(
F.col("fare_rank") <= 3
).show(9)
fs.stop()
Deploy
Deploying your code is as simple as running it locally. WIP.
TODO: Deployment instructions, coming soon.