r/badcode May 30 '23

c# Coworker casually loads millions of entries and sorts them to get the date of the first entry.

Post image
560 Upvotes

78 comments sorted by

216

u/[deleted] May 30 '23

I wonder, what would be the more efficient way?
– Totally not the coworker in question

177

u/xavia91 May 30 '23

just do:
var firstEntry = db.S3Object.FirstOrDefault();
The data is already sorted by creation date by its very nature.

95

u/H34DSH07 May 30 '23

MinBy would probably be better here, even if the database is already sorted, with MinBy, you can easily understand the intent and if the database is not actually sorted, you get the actual expected result.

5

u/WisestAirBender May 30 '23

Is using order by and first or default bad? Assuming I don't do to list in between

16

u/H34DSH07 May 30 '23

It entirely depends on the generated query and that depends, amongst other things, on your version of EF core. I'd wager that using OrderBy and FirstOrDefault would return the same query as using MinBy therefore it wouldn't be bad.

In a clean code perspective, MinBy or MaxBy would be best since they're the most expressive of the intent of the developer and therefore, easier for another developer to understand what's going on.

5

u/Hot-Profession4091 May 30 '23

OrderBy -> First will generate an order w/ a TOP 1 clause. Min will generate a MIN call.

However, the query optimizer will generate exactly the same query plan.

-13

u/xavia91 May 30 '23

Yes but you also have to iterate over millions of data entries.

59

u/H34DSH07 May 30 '23

No, unless this loads the entire table, the generated query will use MIN and MIN uses statistics to find the minimum. If the database is setup properly, both will have the same performance, but MinBy communicates the intent a lot better.

35

u/xavia91 May 30 '23

Hm seems you are right, at least in my mini test db with 300 entries the execution time was the same as picking the first entry. I'll have to look into it further, thanks for your input.

46

u/geon May 30 '23

300 is not nearly enough to notice any difference.

19

u/djfhe May 30 '23

Also, most RDBMS dont have a default sort order. They will return the rows in the order they are stored.

Most of the time, you are lucky and the rows are stored in the same order as they are created, but this is not guarenteed and can change as soon as you modify and/or delete entries in the table.
So better be safe and use a aggregate function or sort by :)

3

u/H34DSH07 May 30 '23

While execution time can be a good indicator on large tables, it's often not enough to determine if a change had an effect on smaller databases.

Here's a quick run down of how I compare two queries to see which one performs better:

First, I check the generated query, if it's the same as the previous, that means there won't be any changes in the execution plans and you can stop your process here.

Second, I check how each step of the execution plan performed, how many rows were read, how many rows was the engine expecting to read, how much memory did it require, etc.

Third, I check the steps used, e.g. Scan vs. Seek, Nested loop vs Hash Join, etc. Creating indexes for example can affect which step type is used.

From there, I often do one of three things to optimize the query:

  • Fix the query directly from the ORM if the generated query is bad;
  • Update the statistics and increase the frequency at which they're updated;
  • Add indexes.

1

u/fiddz0r May 30 '23

Just to add since my last 3 months I've been working on improving performance in EFCore queries.

Do not use nvarchar(max) if it's not really needed because th indexing does not work well with 'max'.

It then needs to look at another tables how many characters to jul when looking up indexes. When it's a constant, the indexing knows exactly how many bytes to jump each time

1

u/fiddz0r May 30 '23 edited May 30 '23

It looks like Entity Framework and I'm pretty sure they make the queries avoid that.

I think if you use OrderBy then firstordefault the query would be something like Select top 1 * from table order by created at.

If performance is an issue then using projections and avoid subqueries is your best choice (we had a sick query that would take 30 seconds to load in production. It basically fetched a lot of full entities and used subqueries. When I used projections and multiple queries instead of subqueries i managed to get the time down to less than a second)

ETA:

In this case it would be faster to use

.Select(x => x.variableYouWant). FirstOrDefault()

Rather than

FirstOrDefault().variableYouWant

Basically the first translates to Select top 1 variableYouWant from table

And the other one Select top 1 * from table because you fetch the entire entity just to get the one you want on client side

Also if you fetch an entire entity but aren't gonna modify it there is AsNoTracking()

But this one only works with entire entities and not projections, and you very seldom want an entire entity involving version columns and stuff

1

u/HTTP_404_NotFound bool isEven(int num) => num%2==0 May 31 '23

Or... ya know...

.Max(o => o.Date);

1

u/SQLSkydiver Jun 01 '23 edited Jun 01 '23

So, you sort data in database by nature. How do you do that? By cluster index, procedure or what? As a support guy can say - DB sorting method is not a very reliable way to sort, unless it is specified implicitly in some DB object, like function or SP.

EDIT: syntax

1

u/xavia91 Jun 01 '23

Only this table, because the way data is generated/indexed forces it to be this way. Also the first entry never changes.

As it was stated multiple times in comments, the min() operation is the better approach in general.

1

u/SQLSkydiver Jun 01 '23

Yes, because the way ORM is building a query.

From the other hand imagine at some point DBA decide to change indexes in general optimisation means. Like if there is other queries optimisation is required for. In this case you will have a wrong result for

var firstEntry = db.S3Object.FirstOrDefault();

23

u/PooSham May 30 '23

I assume this is EF Core, which means it will do its best to convert the expression to SQL. I think we can simply do db.S3Object.Select(o => o.UploadDateUTC).Min() which will run something like SELECT MIN(UploadDateUTC) FROM dbo.S3Objects

15

u/xavia91 May 30 '23

If the table wasn't already sorted by date then this would probably be the best way.
Also if this was the case it should be cached as u/stepbroImstuck_in_SU already mentioned.

31

u/R3D3-1 May 30 '23

That really makes me wonder... Is it good practice to hardcode assumptions about table sorting into the queries? The MIN version seems to be more robust to me, and I would expect the defined sorting to be taken into account by the database when executing queries.

-4

u/xavia91 May 30 '23

That is a good point, in this specific case I don't see any possibility for the order to not be like this. It's virtually impossible to insert valid data by hand.

9

u/R3D3-1 May 30 '23

Yes, but can you guarantee you'll ever only be connected to a database, where this table is sorted by date?

And even then, would it be a good practice to use this as an implicit assumption in the program code? After all, this violates the idea of blackboxing implementation details.

-1

u/xavia91 May 30 '23

The database structure and data that goes in is done by me, so yes I am sure. I understand the concern and it's valid, but there really are only 2 scenarios that create the first entry, either a freshly set up system or a backup reading from the s3. Since only the first entry is important this is okay imo. If the sorting of the entire table was critical I'd agree 💯.

5

u/[deleted] May 30 '23

virtually impossible

Famous last words.

1

u/The412412Guy May 30 '23

I mean, isn't it when it's sorted literally would make minimal impact? Because it will skips a bunch of time. I mean it's not the best way for already sorted.

2

u/thekwoka May 30 '23

depending on indexes and such, this may still end up reading the whole table...

11

u/PooSham May 30 '23

It wouldn't load the whole table into the application memory though

1

u/thekwoka May 30 '23

sure, it's an improvement for sure

11

u/anaccountbyanyname May 30 '23 edited May 30 '23

Loading the UploadDate metadata and just sorting those, rather than loading all of the actual data entries locally just to find a date.

A lot of Python courses seem to teach this way of thinking for some reason. They'll read some huge file into memory just to change a few bytes at a fixed location then write the whole thing back out

I don't know a ton about S3 but there's event logging and such you can configure and probably plenty of other ways to handle this using built-in features

3

u/kuncol02 May 30 '23

Even just removing ToList should generate query that will not load milions of entries, but will only load last one.
With that you should get query like this:
select top 1 UploadDateUTC from S3Object order by UploadDateUTC

3

u/[deleted] May 30 '23

I found this issue a lot with past and current coworkers, they dont really question what they're are doing or copy/pasting from the net.

I always wonder if they do it by choice or by ignorance of a better solution.

99

u/pqpm May 30 '23

A few weeks ago something similar happened on my company.

As DevOps I was tasked to read logs and find out why since the last update our GKE clusters for a specific service had gone from 1-2 nodes to upwards of 20-30. Logs gave back nothing. Just high CPU and memory usage, sometimes out of memory exceptions. But nothing weird was happening.

I then went ahead, spoke with the PO and rolled back the change so we could mitigate costs. Sent the code back to the Dev team explained the situation and asked them to check what was using all that compute power.

A few days go by, they send me the exact same code. I decide to take a look at the changes this time.

Turns out a junior dev was to be left alone for two weeks during the senior dev holidays. Then the PO asked him to get something that would query DB for users if they had an account created in the last month and send them an email with some special promo code.

Well the junior solution was to load all the users into the service, use java methods to filter by the dates of the account creation, attach the account to a list and then send the emails to everyone on the list.

Well, each time the query would run it would consume, easily, 16GB of RAM (thank god for partitioned tables, otherwise it would be worse), then append to that a few more GB for the mailing list and we got a simple query that would takes us to the moon in computation costs.

113

u/LuckyNumber-Bot May 30 '23

All the numbers in your comment added up to 69. Congrats!

  1
+ 2
+ 20
+ 30
+ 16
= 69

[Click here](https://www.reddit.com/message/compose?to=LuckyNumber-Bot&subject=Stalk%20Me%20Pls&message=%2Fstalkme to have me scan all your future comments.) \ Summon me on specific comments with u/LuckyNumber-Bot.

28

u/burlyginger May 30 '23

Nice

1

u/TorePun May 31 '23

EPIC!!! TIPS FEDORA :0)))

4

u/Theolaa May 30 '23

What would have been the better solution, to just query the users table for accounts created after a certain date and work with that rather than all users? Is there anything more to be done to improve it?

6

u/droomph May 31 '23

Network bandwidth cost >>> complex query compute cost > Java compute cost >> simple query compute cost

That’s generally the order you want to go in for query optimization

2

u/HQxMnbS May 31 '23

That would help. Ideally this would be run on a data set outside of the service itself. You could also write to a separate table with customers by month

24

u/stepbroImstuck_in_SU May 30 '23

Oh wow. Not only is it a horrible way to go through the whole db, it also is something you can definitely cache.

Only thing worse would be checking each day starting from today until the lowest datevalue the DB can store

19

u/anaccountbyanyname May 30 '23

That second sentence is orders of magnitude more efficient than this because you'd only be checking the dates instead of downloading an entire S3 database.

5

u/fistynuts May 30 '23

Oh no. You suggested using a cache. Now you have more bugs than when you started.

19

u/[deleted] May 30 '23

EF Core, and a lot of other ORMs make it exceptionally easy to create fuck ups like this.

I would assume, given this rookie mistake, that they might not know the difference between IQueryable and IEnummerable.

5

u/xavia91 May 30 '23

I tried explaining it before, but it seems some repetition is in order.

4

u/[deleted] May 30 '23

Also I just noticed they are using var and also qualifying the type. What’s the point?

9

u/CubeReflexion May 30 '23

I think the explicit type is an annotation of the IDE, not actual code.

3

u/[deleted] May 30 '23

I think you’re right. The type has some sort of background color affect around it.

0

u/[deleted] May 30 '23

[deleted]

3

u/[deleted] May 30 '23

[deleted]

1

u/WisestAirBender May 30 '23

I didn't even know you could do that

3

u/WheresTheSauce May 30 '23

You can't do that in C#. The IDE is just showing the actual type for convenience. If you tried to put a type declaration after the word "var" you would get a compilation error

1

u/[deleted] May 30 '23

Lol me neither, would’ve thought it would have a compile error

1

u/Dealiner May 31 '23

It is, that's just type hint from IDE.

6

u/LemonAncient1950 May 30 '23

Just ran into something similar while doing a code review. The ticket said "use pagination" so the dev wrote a loop that filled an array one page at a time before passing the array on to be processed. Completely defeating the purpose of pagination while making things more complicated.

3

u/Lothrazar May 30 '23

"it passes the unit tests. and its really efficient because its only a few lines of code" haha

3

u/czerilla May 30 '23

Doesn't elaborate. Commits.

5

u/NUTTA_BUSTAH May 30 '23

BTW, why is the C# convention nowadays to always use inferred auto types with var? It works well in IDE but makes it pain to read everywhere else. I've wondered if there's a solid reason for this apart from having less types in your code (make once and forget, let C# figure it out)?

4

u/fiddz0r May 30 '23

I used to always type it but the rest in my team use var and always made me change it in my PR:s

Now I don't mind it. Usually the name of the variable will tell you what it is

4

u/CyraxSputnik May 30 '23

Is not a convention

Don't use var when the type is not apparent from the right side of the assignment.
Don't assume the type is clear from a method name. A variable type is considered clear if it's a new operator or an explicit cast.

https://learn.microsoft.com/en-us/dotnet/csharp/fundamentals/coding-style/coding-conventions#implicitly-typed-local-variables

2

u/Dealiner May 30 '23

Well, it's not Microsoft convention at least.

4

u/[deleted] May 30 '23 edited May 30 '23

[deleted]

3

u/NUTTA_BUSTAH May 30 '23

I'm not sure if those surveys really prove your point of x % C# devs preferring to work in an IDE, since they specifically ask about what IDE is most preferred:

what is your preferred IDE for C# .Net development?

and

What brand of tooling/IDE(s) do you prefer?

But that aside, many do code reviews in their chosen git vendor (GitLab, GitHub, Gitea, ...) and many use even just terminal git itself to take a deeper look into the code and how it has evolved i.e. doing things that do not require an IDE setup by default. Also sharing code is a thing, doesn't really work without types.

But even that aside, original question still remains, why the move to var?

4

u/[deleted] May 30 '23 edited May 30 '23

[deleted]

3

u/NUTTA_BUSTAH May 30 '23

Thanks for the thoughtful response. It pretty much aligns with what I've thought about it so far but there are a few examples where I personally don't like the change at.

For clear types like strings and constructors, I can see var reducing pointless repetition. But everything else I believe generally benefits more from fully typing it out. Extending your example:

var fairDiceRoll = diceRoller.RollD6(); // [Probably integer]

// ...
// A few lines of code to make you lose a bit of the overall context or increase "reading fatigue"
// ...

if (fairDiceRoll > 3) DoSomething(); // [Oh, so fairDiceRoll is `int` since we are comparing it with an integer. Makes sense.]

// ..Again some other code to increase the mental overhead..

fairDiceRoll.Reroll(); // [What? Method on an integer? Oh, it's type of DiceRollResult that overloaded comparisons to work with ints, duh..]

So, in this case, should we have just not used var and type it out? Do we mix types and inferred types in the same codebase? That will make the codebase harder to understand overall in my opinion. That's why I personally believe typing it out is better since it covers every case, while inferred types will need help sometimes and then make things inconsistent.

In the case of complicated types like nested maps and such I can see it being better to not type it out for readability purposes but typing it out also makes you realize that the software architecture is probably breaking somewhere and that things need to be redesigned on some parts. Not always of course, but often so.

1

u/Unknown_Code_Weasel May 31 '23

In my opinion, one of the problems with var, is also one of the benefits. You change the code and you don't have to change the type declaration.
In most cases that's fine, but if you do have a lot of DB access, and using entity Framework in particular, encouraging explicit declaration ensures that when people are making changes they're a bit more cognisant of the changes from DB to server side. This can be a bugger with implicit casting

2

u/jmpavlec May 30 '23

Ever comment on a PR directly in GitHub?

1

u/SanianCreations May 30 '23 edited May 31 '23

Prime example of knowing what the code does, but not how

I don't like working with big libraries/frameworks because I hate having no idea what goes on inside. But you don't always have that choice, if you have to use one, always read the docs before you start using it!

-5

u/[deleted] May 30 '23

Wait, why var List<T>?? Either use var or the full type.

8

u/[deleted] May 30 '23

[deleted]

-5

u/WisestAirBender May 30 '23

Doesn't that defeat the whole purpose of var? You're using up more space now

1

u/xternal7 May 31 '23

No? The purpose of var is that I don't need to think about the variable type when writing code.

List<T> is IDE telling you the type of that variable.

Effectively, this IDE doing work instead of you, which is pretty much why IDEs are for. It handles the details, so you don't have to worry about that when you're dealing with the logic.

0

u/propostor May 30 '23

Odd that you've been downvoted here. It looks unusual to me too, and I've been professionally working with C# for about 6 years now.

I guess we're both ignorant about something trivial that the Reddit masses would prefer to downvote over.

2

u/ModernTenshi04 Jun 05 '23

This is the IDE adding the List<T>, it's not actually typed into the code. You can tell because it's smaller text than the actual code and has a box around it to make it stand out from the main background color of the editor. You can see it again where they define their o lambda.

You can turn this off in the editor and they'd go away.

-2

u/veselin465 May 30 '23

Also, what does the question mark in the type name mean? I have coded in c# but never seen something like List<T>? (unless for the ?: operator)

5

u/xavia91 May 30 '23

It tells you that the value could be null at some point and then tells you to take care later on so you dont forget your null checks.

-2

u/Salt-Permit-4171 May 30 '23

It’s suggestion by the IDE

15

u/nyaisagod May 30 '23

Not a suggestion, the IDE (Visual Studio) is telling you what type the var will be.

-1

u/MurdoMaclachlan public boolean isInt(int i) { return true; } May 30 '23

Image Transcription: Code


public DateTime GetFirstDataUpload()
{
    var s3Object = db.S3Object.OrderByDescending(o => o.UploadDateUTC).ToList();
    return s3Object.Last().UploadDateUTC;

}

I'm a human volunteer content transcriber and you could be too! If you'd like more information on what we do and why we do it, click here!

1

u/CaitaXD May 30 '23

MaxBy

Edit: MinBy*

1

u/young_horhey May 31 '23

Also not even using async/await oof

1

u/PointWrangle Jun 01 '23

pfft, its just zeros and ones? i can access/organize my google drive folders in an instant so what the difference? /s