-
Notifications
You must be signed in to change notification settings - Fork 5
Progress: Meeting Minutes and Deadlines
see Trello board at https://trello.com/b/dz7GcoDM/graph-zeppelin-board
- Kenny (& Ahmed): Streamify a graph with ~100k nodes and run continuous correctness tests on them.
- Victor: Keep working on post-processing investigation. Be on call for Java questions from Abi.
- Abi: Schedule meeting with David, Tyler to talk about running on the AWS cluster.
- Evan: Run gutter tree experiments with new granularity for NVME.
- All: writing!
- Victor: Continue running experiments.
- Ahmed/Evan: Continue running NVMe experiments.
- Abi: Continue prodding at GraphX on the school cluster.
- Victor: troubleshoot continuous correctness tests.
- Ahmed/Evan: continue followup experiments (nvme, no mem limit, parallel)
- Abi: Ensure graphx swaps to disk when it runs out of memory instead of crashing.
- Abi: Finish pre and postprocessing for novel distributed model.
- Tyler: Continue low-level C++ optimizations for buffers.
- Everyone: Schedule writing meeting with David. Harass him relentlessly and without mercy.
- Kenny: Continue correctness experiments on kron data.
- Kenny/Ahmed: Find standard (feasible) real world graphs to run correctness tests on.
- Ahmed/Evan: Rerun swapping experiments on NVMe.
- Abi: Ensure graphx swaps to disk when it runs out of memory instead of crashing.
- Abi: Finish pre and postprocessing for novel distributed model.
- Tyler: Continue low-level C++ optimizations for buffers.
- Victor/Evan: Verify that multi-threading post-processing is being used correctly.
- Everyone: Schedule writing meeting with David. Harass him relentlessly and without mercy.
Click to reveal
- Everyone: writing!
- Abi: merge Victor's PR and run on cluster
- Abi: Work with Tyler on GraphX
- Kenny: Push bucket repetitions to ~.4 and rerun
- Ahmed: Verify aspen memory usage with paper
- Find stuff to delete on bigboy so Ahmed can work on large graphs
- Abi: OpenMPI implementation with machine queue
- Ahmed: Run Aspen on larger graphs to see how Aspen deals with large data.
- Ahmed: Generate Erdos-Renyi graphs with ~300k nodes
- Kenny: Keep investigating repetitions
- Evan: Look at shuffling graph streams more
- Victor: Test and tune parallel post-processing
- Victor: Anonymized repo with run instructions for kron13
- Abi: Rework OpenMPI setup to have workers request work from master node; investigate potential compiler issues
- Ahmed: Draft email to aspen authors, convert kron17 and kron18 for memory footprint tool
- Ahmed, Evan, Tyler, David: Take a closer look at data set generation.
- Abi: Look at Tyler's existing code and try Open MPI for making our system distributed.
- Tyler: Switch to AWS lambda
- David: Try to use NSF funding for AWS credits
- Evan & David: Meet with Michael to discuss Gutter Tree performance
- Ahmed: Suggest fiddling to do with Aspen's setup
- David & Victor: Checkin around Ahmed's adjacency list generation format for the memory footprint tool
- Abi: Help out with Aspen stuff
- Everyone: Keep iterating on the writing - Check spreadsheet (Tasks page redesign by David)
- Abi: Figure out how to run Spark with Slurm.
- Abi: Automate cluster setup with Ansible
- Ahmed: Have a working executable to run the connected components subroutines of Terrace and Aspen.
- Victor: Efficient node-level synchronization
- Victor: In-memory buffering
- Evan, Victor: Merge sync, circular linkedlist code
- Evan: Basement nodes
- Kenny: Correctness tests at scale (?)
- Abi: Try running our main branch on Mike Ferdman's cluster, and also a toy PySpark distributed program to see how it scales with more nodes.
- Ahmed: Run stress-tests on lab machine.
- Victor/Evan: Figure out what multithreading framework to use for the buffer tree and whether we want to use different sizes for nodes in the buffer tree.
- Kenny, Evan, Victor: (low priority) rerun old tests on new machine
- Kenny: Use linux sort for edge-deduplication and write an on disk spanning forest verifier.
- Tyler: Rebase; Replicate kenny's experiment on new lab machine; compare results to spark implementation
- Abi: Figure out Ansible stuff; run main branch on cluster; download third data set.
- Evan: unify optimizations into a single prototype
- Victor: N/A
- Ahmed: Push main branch to its limit using Graph500 generation; report what breaks first.
- Kenny: Graph and Stream Generation prototyping.
- Tyler: Still working on Spark prototype.
- Abi: Downloaded data, but the graph is very sparse. Need to find dense graphs.
- Abi: get cluster access set up.
- Evan: Get either OMP or std::thread to work with the prototype implementation
- Victor: Get asynchronous buffer tree updates working
- Ahmed: Think about experiment design and workloads. How can we test skeletons etc. Check the tests written up in the paper.
- Kenny: Done with optimizations. Look at how to manage tests when the graph doesn't fit in memory.
- Tyler: Still working on Spark prototype.
- Abi: Downloaded data, but the graph is very sparse. Need to find dense graphs.
- Evan: Get openMP to work with the buffer tree.
- Victor: Buffer tree debugging stuff to do.
- Ahmed: Look at evaluation section of overleaf.
- Ahmed: Look into how to adapt Kronecker exponentiation for our use case, if there's something we can do maybe work out a demo
- Ahmed: Literature review of semi-external memory systems
- Ahmed: Finish experiments with k-connectivity
- Kenny: Implement 32 bit checksum with vector of 'a's and vector of 'c's, ditching buckets because of struct padding
- Evan + Victor: Implement optimizations for prototype (openmp for graph workers, pull in sketch level parallelism, use half repetitions)
- Abi: Deal with dataset, start analyzing
- Abi: Look into transformations to make dense graphs with properties of original graph
- Abi: Send emails to get access to Spark cluster
- Tyler: Code up Spark prototype
- Ahmed: Run profiling experiments with proper k-fold stream ingestion
- Abi: Try downloading a larger part of the dataset and answer the same questions about the graph.
- Kenny: (Deadline pushed from Thursday) Try and take a look at delta idea. Cut down on the bucket size by trying with 32-bit hash.
- Victor & Evan: Finish FBT integration prototype.
- Evan: Bugfix buffer tree so tests pass.
- Ahmed: Run profiling experiments and investigate graph generation.
- Abi: If you remove all but one edge between each pair of nodes in the network data set, is it still a dense graph? What is the density of the graph?
- Evan: Try to get Victor's code to run. If successful, migrate from pthreads to OpenMP and benchmark result against Toku with all of our optimizations enabled.
- Kenny: Try and take a look at delta idea. Cut down on the bucket size by trying with 32-bit hash.
- Victor: Fix bug/linker error with FBT integration with Evan's help. Talk about some OpenMP-specific stuff with Abi.
- Ahmed: Fix bug in implementation and look at ways to generate dense graphs. Try to get plots showing there is a k-factor blowup.
- Abi: Talk with Tyler to get the swig stuff resolved, and look at the networks dataset and preprocess it to make sure its not a multigraph, and is also dense.
- Victor: Fix bug/linker error with FBT integration with Evan's help. Talk about some OpenMP-specific stuff with Abi.
- Evan: Help Abi with distributed stuff
- Kenny: Try and take a look at delta idea. Cut down on the bucket size by trying with 32-bit hash.
- Ahmed: Test k-connectivity. Put up a PR for spanning forest.
- Kenny: Run continuous tests on lab machine. Write section 4.1.
- Victor: Integrate buffer tree into
xor_WODSbranch. Replace Toku.
- Abi: talk to Mike Fergman about cluster
- Ahmed: k-connectivity tests, pr for spanning forest generation
- Kenny: continuous graph testing done by Monday if possible with multiple verifications along the way. Writing section 4.1
- Victor: pr OpenMP, post processing tests to verify working at large scale and start integrating external version. Writing section 4.2/3
- Evan: Synchronous Buffer Tree prototype and test entire system.
- Everyone: Discuss random graph generation at next meeting
- Abi: talk to Mike Fergman about cluster
- Ahmed: k-connectivity
- Kenny: continuous graph testing. potentially buffer edge updates in test for multithreading.
- Victor: OpenMP staggering threads, talk with Abi
- Evan: finalize buffer tree
- Ahmed: get setup on lab machine; time/space profiling for bipartite testing; k-connectivity implementation; make dense graph unit test
- Kenny: Run sketch repetition sparsity experiments on graphs; sketch experiment where we query as we insert. E.g. count failures every 100 inserts
- Victor: Investigate Abi's suggestion of adding additional layer of parallelism: updates which are adjacent in memory are handled by the same thread
- Abi: contact "Fergman" regarding cluster; push a PR for serialization fixes; contact Tyler to fix remaining SWIG serialization issues
- Evan: keep working on the buffer tree
- Ahmed: k connected (set of k edges if removed break CC), Testing larger graphs w/ bipartite stuff, Verify that space and time blow up hold
- Kenny: Run sketch repetition sparsity experiments on graphs
- Kenny: Sketch experiment where we query as we insert. E.g. count failures every 100 inserts
- Victor: Re-run OpenMP experiments with full-length types and omp timer
- Evan: Multi-level buffer tree implementation
- Abi: Continue plugging away at SWIG with Tyler
- Abi + Victor: Push a PR for serialization fixes (by Tuesday)
- Ahmed: k connected (set of k edges if removed break CC), Testing larger graphs w/ bipartite stuff, Verify that space and time blow up hold
- Kenny: write this stuff down in the wiki. Need to figure out best constants for size of datastructure so compare p value versus failure rate.
- Victor: Fix Sketch-Level multi-threading implementation
- Evan: Keep working on buffer tree, Bug David if he hasn't sent me the information about the Stony Brook Cluster
- Abi: Create pull request for CMakeLists changes that fixed serialization issues
- Abi: Still bashing our heads trying to fix serialization issue with Boost.
- Victor: Can run into issues with OpenMP and currently having issues with serialization. Working with Abi to get this resolved.
- Ahmed: Ask Victor one question and implement code to count the number of connected components
- Kenny: Work on statistical tests and verify that they can be replicated/reproduced easily.
- Evan: continue with the Toku experiments and learn more.
- Alex: Getting Vmware's code so we can use it.
- Victor: Implement sketch level parallelism
- Kenny: Figure out proper statistical sketch testing, set up meeting with Victor,David,Kenny
- Kenny: Million nodes, how much bucket repetition?
- Evan: Figure out weird multithreading stuff.
- Ahmed (with Victor as guide to codebase): get familiar for code base, begin work on bipartite algorithm
-
Tyler and Abi: can take next steps on serialization issues. Easiest is probably boost so they'll try that.
-
Kenny: Optimizations - eliminate the duplication of the bucket which has everything. Also multithreading with Evan and Victor. If done with multithreading- finish up largeSketchTesting.
-
Victor: Multithreading with Kenny and Evan. If done with multithreading- finish up largeSketchTesting.
-
Evan: Multithreading with Kenny and Victor.
-
Ahmed: Note down workloads used. Talk with David/Evan about external memory algorithms and the performance they were seeing. Potentially later -> Find real world graphs that have our properties. Or benchmarks that can be adjusted to our use case.
- Zhang gang: merge hash-xor into main branch.
- Kenny: repeat vectorized sketch update experiments on lab machine, verify that AVX512 is enabled, and devise "dummy" code (independent from sketching) to benchmark vectorization on a few basic functions just to see if a speed-up can be achieved.
- Victor: wrangle bugs and continue development of post-processing.
- Ahmed: contact David and formalize a proof of the happy accident that is our current xor-hashing.
- Abi: revisit "short" fix for serialization issues with Tyler (if possible), otherwise begin "long" fix of modifying Kenny and Victor's C++ code to work with SWIG.
- Evan: begin work on multi-threading updates.
- David, Ahmed: Determine how our model competes in the external memory world
- Kenny: Toy model (xor-ing multiple things together) using AVX2/AVX512
- David, Victor: Implement a first pass of I/O-efficient sketch querying
- Abi, Tyler: Nail down interop issues
- Evan: Take Victor's setup for limiting memory and apply to Toku.
- Ahmed: Figure out the lower bound for I/o complexity of the external graph algorithms explored in the survey paper.
- Kenny: Implement the suggestion based on trying fewer hash calls that should result in having log N less buckets.
- Abi/Tyler: Still working with Tyler to integrate Swig and spark to work without issues.
- Victor: Chase down failing tests.
- David: Figure out how large inputs need to be for us to beat competition (click here!).
- Kenny: Try making contains guesses not independent, 1 call for each set of levels.
- Evan/Victor: Verify xor hashing, make sure no weird bugs
- Victor: Change unit tests to efficient-gen, replacing binary GTgraph-random
- Evan: Do performance analysis to see if toku speedup is due to locality or batching
- Evan: Help Victor with buffer tree implementation
- Abi: Work on interop
- Kenny: think about how to aggressively vectorize sketch updates
- Ahmed: Figure out what current state of the art is, and what guarantees they give in terms of I/O complexity
- David: Redo guess of how long everything should take based on newer runtime/memory numbers
- Evan, Victor: Verify stream generation is correct. Re-run Toku tests with xor-hashing.
- Ahmed: Understand opposition. "Know thy enemy and know thyself" - Survey of external memory large‑scale graph processing on a multi‑core system
- Kenny: redo time profiling to remove overhead introduced by the timing tools, and communicate with Evan to remove the "b" component of the buckets.
- Evan: investigate the timing tools as a possible source of the inconsistencies between lab and virtual machine results of toku testing.
- Abi: consider Cython to resolve SWIG/Spark integration issue. May need to opt for a different language version of Spark.
- Kenny: Re-profile sketching. Modify bucket to remove
b. - Evan: Make sure everything is working correctly with Toku for the experiment (e.g.: flushing works properly).
- Kenny: Double-check profiling numbers
- V: Getting out a skeleton of the homebrewed write-optimized data structure.
- Abi, Tyler: Get out the first prototype of spark system which interfaces with sketching code
- Evan: Toku vs naive experiment
- Ah: present what currently understand about the EM graph algorithms.
- K: Profile performance when we use David's XOR sketching technique. Check what's causing failure rate on XOR technique.
- E: Handing off to Evan what is needed to benchmark our stuff vs Toku.
- Abi/T: Have a SWIG-wrapped sketch object in Python and have exposed the methods we need to update it to Python.
Each main meeting, one student will note down the key points from each meeting: a brief summary of each person's progress, keywords or short sentences for topics discussed, etc. The minute-taker will rotate weekly. Rotation (x indicates next minute-taker): | Abi | Evan | Victor | * Ahmed | Kenny |
- Kenny: Continuous testing looks good on random graphs; kron graphs. We should start running this with other graph types.
- Victor: No updates on post-processing itself. Working on the machinery to compare apples to apples wrt post-processing.
- Ahmed: Experiments are done (aspen w/o memory restriction, aspen on nvme, terrace on nvme). These experiments look very good for us. Aspen is slower than us even without memory restriction, and performance isn't too much better on NVME.
- Abi: The swapping problem is still happening outside of spark. If a test RDD is small enough to fit in RAM, it's fine. But larger doesn't swap to disk.
- Abi: It seems that RDDs can't be larger than the size of collective memory, otherwise Spark complains. In particular, Spark
shuffleis failing. - Evan: It seems that block size is a leading cause of slowdown on NVME. But things still look pretty slow with larger block sizes.
- David: It's odd that swapping on NVME is not very good, since standard page sizes are smaller than the optimal page sizes of newer drives. This may be a cool research problem for the future.
- Distributed implementation
- Workers collecting work is done
- Victor's PR:
- Workers can apply deltas without access to graph data structure
- Master can apply deltas to canonical sketch
- Need to merge code, run on cluster, and see what happens
- GraphX
- Tyler might be able to get prototype working
- Repetitions
- 1/3 repetitions, 10k nodes, 49/1000 failures
- 1/3 repetitions, 50k nodes, 3/400 failures
- Good news: Increasing nodes decreases failure probability
- Bad news, still bad failure probability
- Try pushing to ~.4
- Aspen
- Goal: we know we're asymptotically better, and if we can find a practical example where we use less memory we're happy
- Original numbers for kron 17 were bad
- 7.3GB -> 6.711 GB
- Dataset: twitter graph should be a workable size
- Aspen paper number of edges is off from our dataset
- Maybe slight version differences?
- Memory footprint tool is being difficult, ditch it
- Compare to numbers in Aspen paper
- Multithreading buffer tree
- Questionable performance impacts, not sure if its helping
- In memory buffers
- 1/4 size in memory buffer is same performance as full
- Not huge gap between RAM buffers and buffer tree
- Abi: There's some issues with locking on the worker nodes
- Evan: Which we may be able to fix by having all the work be sequential on each worker node. We can try to have a "machine queue" that keeps the status of the machines.
- Abi: But we can maybe cut some corners by having one thread send and receive messages from the workers.
- Ahmed: There are some explanations for why Aspen is being weird. Either Aspen's just really good at compression, we're not measuring things correctly, or we don't understand our own datasets. The ultimate blocker is not the third one, since we've run graphs from the Aspen paper with the same results as well.
- Kenny: (wrt sketches) Ran third/quarter repetitions on larger graphs. Third repetitions failed appreciably when we ran ~10k nodes multiple times.
- Evan: Trying to shuffle the graph stream more.
- Ahmed: Why not just maintain a gigantic bitvector of edges?
- Victor: Parallel post-processing. Will need to test/tune it.
- Evan/Ahmed: Aspen's memory numbers were different as reported by our tools and their tools and we're unsure why. It seems like even on kron 18 we don't beat Aspen in memory usage. With Terrace we have no issues, they're easy to beat.
- Abi: Found a solution for worker nodes not communicating with master node. Will write the setup code to automate this fix.
- Abi: Need more work on how to use Spark with Slurm. Maybe Evan or Tyler can help?
- Ahmed: Looked at using in-memory systems Terrace and Aspen. It's horribly documented and the source is archaic and bad. Will have a running prototype coming soon.
- David: This will help us tell our story: In-memory things will be fast up until they can't fit into memory anymore. Then hopefully they will be slow and we will be fast.
- Victor: Installed memory controller on Little Man and Big Boy. Be advised that Little Man has an HDD and Big Boy has an SSD as swap file.
- Victor: If we have multithreaded stream ingestion we'll need some sort of synchronization. Looked at sketch-level sync with std::mutex.
- Evan: Try to decrease IO contention with a circular linkedlist of full leaves ("ripe fruit"). When the linkedlist is full, it blocks the buffer tree.
- Evan: It slows down the single-threaded buffer tree slightly, but it's still super fast.
- Victor: There's some discussion to be had about the size of mutexes in sketches. Each std::mutex is 40 bytes, which is pretty large all things considered.
- David: Is it? On a 10k vertex graph, this is a less than 1% space overhead. It may not be too consequential.
- Evan: But even still, in a distributed world we want node-level parallelism instead of sketch-level parallelism, so we might as well do that.
- Abi: Try running our main branch on Mike Ferdman's cluster, and also a toy PySpark distributed program to see how it scales with more nodes.
- Ahmed: Run stress-tests on lab machine.
- Victor/Evan: Figure out what multithreading framework to use for the buffer tree and whether we want to use different sizes for nodes in the buffer tree. Architect experiments to be done at scale.
- Abi: Gained access to cluster
- Ahmed: see "Graph500 generation" wiki page
- Evan: survived semester
- Kenny: Linux sort is apparently an external memory sort (for large enough inputs).
- Kenny: will need to reachitect tests for when the graph doesn't fit in memory.
- Ideas for verification:
- Sort the stream and remove insert/delete pairs
- Generate the graph from a hash function so we know what the final answer is and we get the graph. ie hash value of 0 = not present, 1 = inserted, 2 = ins/del, etc.
- Split this into two pieces Graph and Stream generation. So first generate a graph and then generate a stream from that graph.
- Ideas for verification:
- Victor and Evan
- Both having problems getting OpenMP to play nice as a background task.
- Could switch to using std::thread instead and see if std::thread will play nice with OpenMP.
- Kenny: Done with optimizations. Look at how to manage tests when the graph doesn't fit in memory.
- Tyler: Still working on Spark prototype.
- Abi: Downloaded data, but the graph is very sparse. Need to find dense graphs.
- Evan: Get openMP to work with the buffer tree.
- Victor: Buffer tree debugging stuff to do.
- Ahmed: Look at evaluation section of overleaf.
- Evan: New data on buffer tree, using normal repetitions, no shorter checksum
- Evan: We would like to incorporate sketch level parallelism with supernode level parallelism, might require delta sketch for avoiding concurrency issues
- Kenny: we now have batch update delta sketch implementation
- David: numbers to see in buffer tree performance table: updates per second for graphs as large as you can get
- Kenny: At first glance, 32 bit hash for checksum seems to work, but to actually get space savings code may be messy (instead of vector of buckets, vector of 'a's and vector of 'c's
- Ahmed: Kronecker exponentiaion increases nodes from n to n^2, might be too much for us
- Abi: script to download data set and download each file. Around 1 TB, order 70-90 hours download. Not sure how many nodes in data set, estimate 1-2 million nodes
- Discussion on how to process our multigraph. Combining multi edges into 1 retains density, but requires more processing. Toggling edges (odd=in graph, even=not in graph) would not retain density, but we could directly feed the stream into our algorithm with extra processing. Graph complement? Transitive closure-esqe stuff?
- Ahmed: Fixed k-connectivity to not copy 1 graph, and instead use k parallel instances
- Ahmed: Memory consumption estimates in k-connectivity paper were very accurate (wow!). Time consumption for the skeletons is very bad (but in line with our calculations). A k-skeleton creates k instances of the graph sketch, finds a spanning forest on the sketch, removes those edges from all other sketches, and repeats.
- David: So we would also see an k-fold increase in stream ingestion?
- Ahmed: Yes, but actually no. Instead of creating k sketches we use one sketch and just copy-construct it when we iterate. Somehow this still passes our tests, even though this is super not kosher?
- David: This is a bruh moment. We can't do that. Try profiling again without optimization.
- Ahmed: Kronecker product graphs look interesting. Kronecker products of bipartite graphs are disconnected. Kronecker exponentiation mimics the evolution of power-law networks. We may be able to use this to generate realistic graph streams.
- Abi: The hosts dataset is absolutely massive. Looking at a small 10GB sample, it's even worse. The vast majority of edges are duplicates. We can probably store and process the dataset piece by piece to investigate further.
- David: Set aside a chunk of the drive (1TB) to download some of the graph and process it.
- Victor: Pushed some updates to the FastBufferTree repository to notify post-processors of tasks. Pushed updates to the StreamingGraphAlgo repository to process updates asynchronously
- Evan: FastBufferTree integration with StreamingGraphAlgo is done! Tests are passing, experiments are promising. Showing a 2+ times speedup (vs. 4 times slower for Toku)
- Evan: Even without parallelism it's looking very good.
- Evan: We may have made our buffer tree so good that it's being bottlenecked by the sketch update time. If the leaves get full, we're kind of screwed
- David: Maybe we can think about dumping this update queue to disk, but that's a problem for another time.
- Tyler: Wants some experiments to benchmark Spark stream ingestion against.
- Kenny: Has a single-threaded version of this benchmark.
- Victor: Has a multi-threaded version of this benchmark.
- Ahmed: Fixed PR issue: turned out to be a boost version incompatibility problem with the function signature of stoer_wagner_min_cut
- Abi: Learning and experimenting with xgt, the library used for processing the real world data set.
- Ahmed: Fix bug in implementation and look at ways to generate dense graphs. Try to get plots showing there is a k-factor blowup.
- Abi: Talk with Tyler to get the swig stuff resolved, and look at the networks dataset and preprocess it to make sure its not a multigraph, and is also dense.
- Victor: Fix bug/linker error with FBT integration with Evan's help. Talk about some OpenMP-specific stuff with Abi.
- Evan: Help Abi with distributed stuff
- Kenny: Try and take a look at delta idea. Cut down on the bucket size by trying with 32-bit hash.
- David: We should take a look at using more "real-world" datasets. Including power-law sets, network graph from Los Alamos.
- Evan: can probably clean the data with scripts he's written
- David: but we need to be careful about multiple edges.
- Ahmed: Haven't done much for k-connectivity. Looked at making bipartite graph algorithm more memory efficient. Thinking about enabling a random walk on the graph using a small amount of space. If we run many of these random walks without encountering an odd loop, we return "true" for bipartiteness.
- David: This may be very hard to do. There's a lower bound for simulating random walks of N*sqrt(T), where T is the length of the walk.
- Kenny: Continuous graph tests is done. All is left to do is run the tests.
- Kenny: We should look into using unique pointers instead of
new/malloc. - David: Maybe this is something we look into after a paper submission.
- Evan: No progress on the buffer tree
☹️ . But it's very close to completion. All that's remaining is to integrate the Streaming graph algo with the buffer tree implementation. - David: This is important and we should get this working sooner rather than later.
- Victor: OpenMP PR has been up. Kenny please take a look.
- Victor: Post processing is moving on without serialization. It looks like the new algorithm is roughly 5% (?) faster and almost ready for review.
- David: Shelve this for a few days to integrate the buffer tree into current repo.
- Paper draft shared with everyone on overleaf
- Kenny: looking at doing some preliminary tests and is working on: Buffering edges (without buffer tree) and additionally, working on adding ability to verify at multiple points durning this testing process.
- Victor: working on OpenMP. Thinks there's something to false sharing within the sketches. Has tried putting all the sketches in a vector this made things very slow for single sketch update.
- Ahmed: Has k-connectivity working.
- Evan: Small progress on buffer tree.
- Paper draft coming soon
- Ahmed: bipartite matching profiling is now on a wiki page
- Ahmed/Kenny/Victor: memory leak in supernode/graph, fixed in master
- Kenny: Graph profiling, continuous sketch test. Quarter repetitions looks promising for the size of graphs we'll be working with
- Abi: Spark still being frustrating
- Congratulations Kenny on 3rd place at ICPC greater NY regional!
- Alex: SplinterDB access is just waiting on bureaucratic stuff.
- Ahmed: Finished sparse and cycle unit tests for bipartiteness testing: both pass. Practically done with memory profiling - just need a better computer.
- Victor: Experiments indicate that a minimum of 10^5 edge updates are needed in our buffers to start experiencing full benefits of parallelism. We've likely almost milked the parallelism cow dry at this point, barring a suggestion Abi made (see deadlines).
- Abi: fixed a few serialization issues after Tyler's work - still need to fix some more.
- Evan: finished multi-level implementation of buffer tree.
- Feel better Ahmed!
- Ahmed: having an issue with gdb and the CMake system.
- Evan: Suggests a fix involving "cleaning" the build directory before rebuilding.
- Kenny: Did some experiments with half/quarter-repetitions. It seems that as the sketches get larger, the fewer failures we get.
- David: Talked with the [expert], consider the "test-until-failure" model. As we insert edges into the sketch, query the sketch.
- Victor: Some experimental results from the OpenMP implementation.
- Kenny: These results may be coming from the fact that we're using native (32bit) for experiments instead of full-length (64bit).
- Victor: Will re-run these tests with OpenMP clock and 64bit types.
- Victor: There are concerns about the size of our
bucket.avalue and the size of inputs we're processing. If we have roughly 4 billion nodes, we'll hit on the 64-bit integer limit for sketch indices. - David: This may be an issue down the road. Let's worry about this when we get there.
- Evan: Buffer tree impl is getting somewhere. We're seeing roughly 2 million insertions per second already (on Evan's personal SSD machine)
- David: We want to eventually get to the speed of disk, or some log factor off of that bandwidth.
- Evan: Next step is to get multi-level buffer tree up and running. Thinking about allocating all of the size for the data structure up-front instead of dynamically.
- Abi: Still having issues with SWIG. Thankfully no issues with serialization anymore (w00t).
- Ahmed: finished bipartite stuff
- Alex: Splinter still waiting for bureaucratic issues. We will need to sign some stuff (we -- Martin, Michael, and some of the students)
- Statistical Testing: Probably doesn't need to be very strict for papers. Although, Victor is worried that perhaps we might not find bugs if we aren't strict with our tests.
- Kenny: has written basic hypothesis testing and has noticed that default bucket repetitions give way better than 1/n failure rates. This is promising for trimming down repetitions.
- Victor:
- External Memory post-processing -> Sat down with Abi to discuss serialization issues and thinks he's found a solution perhaps.
- Sketch-Level Parallelism -> Is hopeful that openMP will be good for us. Made a mistake in setting up the parallelism which is why he was concerned last meeting.
- Evan: Did some more tests with Toku. Queries + deletions are killing us especially when tree is big.
- False Sharing -- issue with 2 variables in the same cache line but this shouldn't be an issue for us because the data that each thread is operating on is much bigger than 64 bytes.
- Why do we need distributed -- in memory will fit better but alternative could be external memory. Disk bandwidth could be bottleneck for both external and our setup. We only require one scan and they require more. This may be a significant runtime factor. However, our algorithm is bottlenecked by cpu not disk. Distributed setup and help us until we get to disk bandwidth.
- Abi: Still bashing our heads trying to fix serialization issue with Boost.
- Victor: Can run into issues with OpenMP and currently having issues with serialization. Working with Abi to get this resolved.
- Evan: Take the Toku system and run long experiments on it. Pitch: We need to use our own buffer tree or modify Toku/Splinter.
- Ahmed: Ask Victor one question and implement code to count the number of connected components
- Kenny: Work on statistical tests and verify that they can be replicated/reproduced easily.
- David:
- Make data structure smaller. Make constants as small as possible.
- Optimizing ingestion part of stream (pre-processing). Minimizing update cost whether that means using faster hashes or using the hardware more efficiently.
- Write-optimized version of data structure
- Distributed approach to making things faster.
-
Lab machine is back up
-
Abi: SWIG is still annoying. Tried to add serialization on cpp side, but no luck. StackOverflow suggests PyBind? Talk with Tyler about this.
-
David: Does Spark GraphX need enough combined memory throughout the workers to hold the entire graph? Don't think so, RDD sits on disk.
-
Victor, Kenny, Evan: Multithreading plans: supernode level parallelism, or sketch level parallelism.
- Assuming someone running our code will have a decent number of cores, lab machine is 20
- Sketch level: there are log(n) sketches per supernode, if n=1e6, log(n)=20, we can do 1 core per sketch
- Supernode level: whenever we get a batch of updates from toku, send that batch to a worker
- Problem: Adversarial graph: a bunch of updates all with one common endpoint. Hammers on a supernode. Probable solution: instead of acting on the supernode, calculate a delta and do a locked update using delta to the sketch afterwards.
- Performance: At some point, we reach a limit where adding more threads doesn't increase performance.
- Smallest case (1000 nodes): more threads decreases performance, could be due to locking overhead.
- Toku being a problem: queries are a source of locking, aren't cheap. Insertions and flushes can block each other.
- Might be time to switch to other write-optimized data structure.
-
David: Building on top of connected components algorithm:
- There's some number of connected components in original graph.
- We can do some sort of transformed graph, and find some number of connected components in transformed graph.
- If transformed graph has double the number of connected components of original graph, then original graph is bipartite.
- Transformation:
- Double each original node
uintou_redandu_bluepair. - For each original edge
(u,v), connectu_redtov_blueandu_bluetov_red. - Note that this transformation can be done edge by edge, so we can do it while streaming.
- Double each original node
-
Alex: Splinter:
- If someone wants to use Splinter, some form signing has to happen about who owns what code.
- If everything goes well, could start using by next week.
-
Abi and Tyler talked briefly about fixing serialization issues in swig. How to serialize swig objects in pickle is a issue.
-
Evan seeing 25% improvement with toku.
-
Kenny doesn't think vectorization will help too much but we might see a small improvement from it. Trailing zeros in a hash being what we use for updates means that we are often only updating a few buckets. We could see a small improvement if we used vectorization inside a bucket for updates. Therefore doing one vector operation instead of 2 XORs.
- Speeding up hashing through AVX? xxHash3 better than xxHash64.
- Bucket that contains everything is duplicated log(n) times. Theoretically should be okay to collapse into a single bucket shared amongst the log(n) bucket 'groups'?
- David has an idea of how to shrink datastructure a little bit more.
-
Victor has a single machine, single thread implementation of the buffer tree working. 2 ways to do IOs, 1. filesystem, 2. throw up hands and let OS handle paging. Wants to check this on the lab machine.
-
Ahmed: Proof updates. David says that 2 wise independent hash functions should be good enough and should make the proof easier. The idea is to fix the current value of c and what is that probability that if the bucket isn't good but it looks like it is good. Trick is to consider the penultimate c value.
- Trillion edge graphs in 2 hours with external memory graph algorithms. Is this because of super special nice memory which allows them to process super fast?
- Our algorithm is good in different datasets than what other folks run their experiments upon. We need the right datasets: we need dense, big graphs. If we run on their benchmarks we will run poorly.
- Abi: "short" fix for serialization issues, which consisted of adding methods to python side, proved bust.
- Evan: 30% speed-up on Toku for medium sized graphs with latest xor hashing update, need a bit more time to incorporate kenny's most recent changes to the batch updates.
- Kenny: Unsure if vectorization will result in an actual speed-up. Preliminary experiments on his machine actually reveal a marginal slow-down (thousandths of a second). Seems attention should be shifted to vectorizing the hashing for set containment instead.
- Victor: Post-processing still a work in progress. Need to devise an I/O framework to begin integrating other code. The repo needs some restructuring to reflect our most recent developments in sketching.
- Ahmed: (temporarily) finished surveying existing CC implementations in external memory. The experimental evidence does not reveal relevant competition as clearly as hoped. Attempted to formalize proof of our current XOR hashing which defines "a" with XORing as well, but unsure how 2-universality of hash family is sufficient.
- David: Our progress is good, but we need to be faster and smaller if we want to clearly beat our competition.
- Ahmed: 2 external graph processing models are pretty good, in I/O. CLIP is a better framework, experimentally. GraphChi is a good implementation within an existing framework with good complexity (log(V/M)*SORT(E)).
- Kenny: Reusing generated hashes for each "column" with the same guess of \mu gives us a good speedup, which grows smoothly.
- David: This looks very good, and implies our impl now grows with logN, as opposed to log^3(N) or log^2(N). But we can maybe be faster with vectorized operations.
- Alex: We can probably write everything with AVX2 in mind, since the tech is like 7 years old now. Since our new lab machine has AVX512, we can have preprocessor directives to detect it and write code to take advantage of that.
- Victor: Failing tests are fixed. No changes were necessary for graph or sketch; it was a problem with the tests themselves.
- David: Since we probably will have access to SplinterDB, homebrewing a WOD is a second-order concern. We should look at implementing the Boruvka portion now.
- Evan: Most of the time spent is in bucket::contains.
- Ahmed: Figure out the lower bound for I/o complexity of the external graph algorithms explored in the survey paper.
- Kenny: Implement the suggestion based on trying fewer guesses that should result in having log N less buckets
- Abi: Still working with Tyler to integrate Swig and spark to work without issues.
- Victor: Chase down failing tests.
- Evan/Victor: there's still a bug with a graph test, edges that get inserted and later deleted cause problems. It passes for xor hashing (twice in a row during the meeting), likely a problem with sketching code.
- Abi: python didn't like interop, didn't know how to serialize SWIG objects.
- Why is toku faster? Better locality? Do we want to optimize for cache instead of I/Os?
- Kenny: hashes are probably our main bottleneck. For each bucket there's one hash to calculate bucket seed, for each contains call there's one hash. Batching eliminates bucket seed calculation.
- Does a sketch fit in cache? For 1000 nodes, each individual vector sketch is
length 1000000, and log_2(1000000)=20, which means its size is ~6.4k bytes. Each node has log(1000) sketches, ~64k per node. Fits in L1 cache.
- Evan: Note: Setup buckets with powermod took a while.
- Evan: Now that we have these space/time reductions, do we still need distributed? Most likely, x16 speedup and x3 space reduction probably doesn't reduce millennia to hours.
- David: Next thing to think about for sketching: there's one hash call for each contains call, maybe could make contains calls not independent.
- Ahmed: Graph survey paper has various inconsistencies in their tests and size of memory given. No data on I/Os. Not comprehensive.
- Evan - Ran toku implementation on lab machine 3-4 times averaging a 60 minute runtime. Curiously, running the same test on a local virtual machine only took 16 minutes.
- Kenny - Removed "delta" component of edge updates from sketch - only the indices are stored in the buckets now.
- Abi - Experiencing difficulties with Python when attempting to integrate SWIG and Spark. In particular, Spark does not know how to serialize SWIG objects.
- Ahmed - Finished presenting deterministic external memory connected components algorithm.
- Alex: Most likely can use SplinterDB. Will be able to know for sure by "next week". If we can, we'll have the source code available to us. It's unknown if we can release modifications open-source.
- Victor: Prototype skeleton up here.
- David: We should think about pushing buffer metadata to disk, since theoretically we don't have O(n) memory. But this is fine for now.
- Evan: Suggests having
buffer_control_blockat the end of the file-backed DS on disk. Read from the end of the file to access. - Kenny: The profiler is slowing down the testing of the sketch code. The largest bottleneck for sketching is for
contains()(Updated on page) - David: We want to know a rough benchmark of how much overhead is actually being involved by the profiling.
- Kenny: Updated invalid test data and now failure rate is consistent with what we had before.
- Abi: Distributed code to create and update sketch in parallel. (Interop done, Spark roughly 50% done.)
- Evan: Toku is "sooper fast" compared to no write-optimization or batching. Roughly 10x speedup (Updated on page). Next step is to do this same experiment for xor-hashing.
- David: This is a great result, but it's super surprising. Can we figure out why exactly we're doing so well?
- David: We should compare ourselves to the "state of the art" in graph streaming. Ultimately, we need to know what targets to hit so we're better.
- Kenny -- Additionally should look at failure rates, and evaluate the tests (should they change)
- Abi and Tyler -- Working on first prototype of spark system which interfaces with sketching code.
- Evan -- We should look into a Toku to spark interface (via a file?)
minutes taken by Abi
- D: Should set deadlines for tasks that people are doing, and either deadline should be met, or we should resolve issues with meeting such deadlines.
- Ah: EM Connected components algorithm simulates a PRAM variant of the algorithm that is internal memory. Refers to list ranking algorithm but doesn't explain why the algorithm is needed. Algorithm achieves the lower bound for EM connected components.
- K: Still working on profiling performance for David's XOR sketching algorithm.
- V: Experiment we want to run: compare our non-write-optimized algorithm works versus Toku, two executables for this benchmark on the github, handing off to Evan. Without any optimization, our algorithm on input 1000 nodes (10^6 updates), taking 16 min.
- D: Old version or XOR technique?
- V: Old version.
- D: XOR technique seems to fail sampling more often than it should. Need to fix before we test with graphs. How do you compute the hashes? Kenny should check failure rate of the hashes.
- V: Still trying to setup homebrewed write-optimized data structure (buffer tree for now).
- Abi/T: Have a Python program that takes a batch of edges, and uses interop to call our C++ sketch update code to return a SWIG-wrapped sketch object. Then parallelize this stuff using Spark.
- D: To distributed team: We also need to worry about concurrency issues for postprocessing.
- Alex: Toku code is user space BeTree code and Betrfs is an in-kernel file system implementation of the Toku code. Can potentially get bottlenecked by CPU or concurrency/thread issues.
